# Introduction
In this Part, we will prepare our data for machine learning in Part IV. 

The interesting thing about the data is that the categorical values are engineered from the numerical ones. 

For example:
- Friends (Willingness to seek help from friends when students encounter emotional difficulties)
- Friends_bi (Whether students are willing to seek help from friends when they encounter emotional difficulties)

As such, our approach for this dataset is slightly different. We will first:
1. Get a DataFrame that contains only numerical columns
2. Get a DataFrame that contains only dummified variables from categorical variables
3. A combination of both numerical and dummified variables

### Step 1: Import pandas
First step, let's import pandas so we can read the DataFrame that we got at the end of Part I.

In [None]:
# Step 1: Import pandas

In [1]:
import pandas as pd

### Step 2: Read the cleaned CSV from Part I
Read the CSV from Part I. You should have 268 rows and 50 columns.

In [None]:
# Step 2: Read the cleaned CSV

In [2]:
df=pd.read_csv("Cleaned_File.csv")

### Step 3: Get DataFrame containing numerical columns only
Sounds familiar? We're repeating Part II Step 4a - selecting columns that contain float64 only.

We expect a DataFrame with 268 rows and 26 columns.

In [None]:
# Step 3: Get a DataFrame containing only numbers

In [3]:
df_numeric=df.select_dtypes(include='float64')

### Step 4: Export the numerical DataFrame as CSV
We'll be using this CSV later in Part IV.

In [None]:
# Step 4: Export the numerical DataFrame as CSV

In [4]:
df_numeric.to_csv("Cleaned_Numeric.csv")

### Step 5: Get DataFrame containing categorical columns only
Now that we're done with numerical, let's tackle categorical values. 

Get a DataFrame that contains only categorical values.

In [None]:
# Step 5: Get a DataFrame that contains only strings

In [5]:
df_category=df.select_dtypes(include='object')

### Step 6: Dummify the entire categorical DataFrame
Now that we have the DataFrame, let's dummify the values to turn the categories into binary features. 

![OnehotEncodingExample.png](attachment:OnehotEncodingExample.png)

We can use pandas .get_dummies method to turn a column, or columns, into dummies. 

<strong>Make sure that you drop first column to avoid redundancies.</strong>

We expect a resulting DataFrame that is 
- 268 rows
- 34 columns

![DummyDropFirst.png](attachment:DummyDropFirst.png)

<strong>Hint: Make sure your drop_first parameter is true to avoid redundant dummy columns.</strong>

In [None]:
# Step 6: Dummify your categorical DataFrame

In [7]:
df_category=pd.get_dummies(df_category,drop_first=True)


In [8]:
df_category

Unnamed: 0,inter_dom_Inter,Region_JAP,Region_Others,Region_SA,Region_SEA,Gender_Male,Academic_Under,Stay_Cate_Medium,Stay_Cate_Short,Japanese_cate_High,...,Friends_bi_Yes,Parents_bi_Yes,Relative_bi_Yes,Professional_bi_Yes,Phone_bi_Yes,Doctor_bi_Yes,religion_bi_Yes,Alone_bi_Yes,Others_bi_Yes,Internet_bi_Yes
0,1,0,0,0,1,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1,1,0,0,0,1,1,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0
2,1,0,0,0,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,1,0,...,1,1,1,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,1,0,...,1,1,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,0,1,0,0,0,0,1,0,0,1,...,1,1,0,0,0,0,0,0,0,1
264,0,1,0,0,0,0,1,1,0,0,...,1,1,1,0,0,0,0,0,0,0
265,0,1,0,0,0,0,1,0,1,1,...,1,1,1,1,1,1,0,0,0,0
266,0,1,0,0,0,1,1,0,1,1,...,1,1,1,1,1,1,0,0,0,0


### Step 7: Export the dummified DataFrame as CSV
Export this DataFrame as well as a CSV for use in Part IV.

In [None]:
# Step 7: Export the DataFrame from Step 6

In [9]:
df_category.to_csv("Cleaned_Objecttype.csv")

### Step 8: Get a DataFrame that is both numerical and dummified
Now that we have a numerical DataFrame, and a dummified DataFrame, let's get a DataFrame that contains both. 

There are two ways to do this:
1. Use .get_dummies on your original DataFrame (don't forget to drop the first column)
2. Concatenate the two DataFrames from Step 3 and 6

Both steps lead to the same DataFrame, so it's fine whichever you pursue. Method 1 is simpler and Method 2 lets you train your DataFrame combination skills. 

At the end, you will have a DataFrame that has:
- 268 rows
- 60 columns

In [None]:
# Step 8: Get the full dummified DataFrame

In [10]:
final_output = pd.concat([df_numeric, df_category], axis=1)

In [11]:
final_output

Unnamed: 0,Age,Age_cate,Stay,Japanese,English,ToDep,ToSC,APD,AHome,APH,...,Friends_bi_Yes,Parents_bi_Yes,Relative_bi_Yes,Professional_bi_Yes,Phone_bi_Yes,Doctor_bi_Yes,religion_bi_Yes,Alone_bi_Yes,Others_bi_Yes,Internet_bi_Yes
0,24.0,4.0,5.0,3.0,5.0,0.0,34.0,23.0,9.0,11.0,...,1,1,0,0,0,0,0,0,0,0
1,28.0,5.0,1.0,4.0,4.0,2.0,48.0,8.0,7.0,5.0,...,1,1,0,0,0,0,0,0,0,0
2,25.0,4.0,6.0,4.0,4.0,2.0,41.0,13.0,4.0,7.0,...,0,0,0,0,0,0,0,0,0,0
3,29.0,5.0,1.0,2.0,3.0,3.0,37.0,16.0,10.0,10.0,...,1,1,1,1,0,0,0,0,0,0
4,28.0,5.0,1.0,1.0,3.0,3.0,37.0,15.0,12.0,5.0,...,1,1,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,21.0,3.0,4.0,5.0,4.0,8.0,27.0,16.0,9.0,10.0,...,1,1,0,0,0,0,0,0,0,1
264,22.0,3.0,3.0,3.0,4.0,2.0,48.0,8.0,10.0,5.0,...,1,1,1,0,0,0,0,0,0,0
265,19.0,2.0,1.0,5.0,3.0,9.0,47.0,8.0,7.0,5.0,...,1,1,1,1,1,1,0,0,0,0
266,19.0,2.0,1.0,5.0,3.0,1.0,43.0,8.0,12.0,5.0,...,1,1,1,1,1,1,0,0,0,0


### Step 9: Export the final DataFrame as CSV
Now that you've the DataFrame, time to export this as our third CSV.

In [None]:
# Step 9: Export the third DataFrame as a CSV

In [12]:
final_output.to_csv("Cleaned_For_ML.csv")

### End of Part III
This dataset is unique in the sense that the features were sort of engineered already for us.

As such, we just needed to split the DataFrame into different parts. 

More specifically, we prepared three sets of DataFrames so that we can work with them in the next Part, which is machine learning modelling.