# Hands-On Exercise 2.3:
# Transforming Data in Python in Preparation for Analysis
***

## Objectives

### In this exercise, you will familiarize yourself with Python syntax commands for transforming data sets in Python in preparation for analysis.

### Overview
In this exercise, you will use Python commands to rescale data, detect and deal with
missing values, bin numeric data into ranges, convert between data types, combine data sets both vertically and horizontally, and, finally, write your own functions as well as use some built-in functions.<br><br>

**Pre-step: Execute the following cell in order to suppress warning messages**

In [None]:
import warnings
warnings.filterwarnings("ignore")

**Major Step: Querying from data sets**

1. ❏ Import the **pandas** and **statsmodels.api** libraries

2. ❏ Import a dataset called **mtcars** and preview it<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Syntax:* &nbsp;&nbsp;&nbsp; ***mtcars = sm.datasets.get_rdataset("mtcars").data***

3. ❏ Import **preprocessing** from **sklearn**

4. ❏ Scaling features to lie between a given minimum and maximum value (eg. 0 and 1) can be achieved using a MinMaxScaler. Use the following syntax to rescale the first four columns of the mtcars data set using the **.MinMaxScaler()** method<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;***scaler = preprocessing.MinMaxScaler()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;MinMaxScaled = pd.DataFrame(scaler.fit_transform(mtcars.iloc[:,[0,1,2,3]]), columns=mtcars.columns[0:4])***

5. ❏ Use the **.describe()** method to view the distribution of data in the scaled columns

6. ❏ Normalization is used to scale data (by row) on a scale of 0 to 1 such that individual rows have a unit norm. Rescale the first four columns of the mtcars data set using the **.normalize()** function<br><br>
*Info: **.normalize()** is a row-wise operation whereas **.scale()** in the next step is performed column-wise*

7. ❏ Shift the distribution of each of these attributes to having a mean of zero and a
standard deviation of one (unit variance).<br><br>
*Hint: Use the **.scale()** method rather than .normalize()*

8. ❏ Load the **titanic** dataset into a pandas data frame from the file **titanic.csv**

9. ❏ View the 25th through to the 29th row of the **Age** column

10. ❏ Use the **isnull()** method to check for missing values.

11. ❏ Calculate the **mean** of the Age column

12. ❏ Replace missing values in the Age column with the mean of that column

13. ❏ Add a new column called GenderCode, which encodes the Sex column using the **.Categorical()** function

14. ❏ View the two columns **Sex** and **GenderCode**

15. ❏ Create a data frame with the following values:

![image.png](attachment:image.png)

16. ❏ Examine the data types of the data frame.

17. ❏ Modify the **B** and **C** columns to be floats using the **.astype()** method and reexamine the data types

18. ❏ Read in the two data sets, data1.csv and data2.csv, and bind them together by row.

19. ❏ Write and execute a function that prints your name.

20. ❏ Write a function that accepts a value as an input parameter and returns its
squared value.

21. ❏ Perform a **principal component analysis (PCA)** on the iris data set<br><br>
*Hint: Exclude the Species attribute*

22. ❏ View the proportion of the total variance explained by each component, using the **.explained_variance_ratio_**

23. ❏ View the coefficients of the new variables

## <center>**Congratulations! You have successfully carried out some transformations on various data sets within Python.**</center>

![image.png](attachment:image.png)

# <center>**This is the end of the exercise.**</center>