## Getting the data ready
Data doesn't always come ready to use with a Scikit-Learn machine learning model.
Before building a Machine learning model we have to do the following things :

1. Deal with the Missing values
2. Convert the non numerical variables to numeric(Feature Encoding)
3. Split the data into X (target variable/dependent variable) and y(features/independent variable) 
4. Then Again Split those X and y into Train and test

In [1]:
#Standard Libraries

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
heart_disease = pd.read_csv('Dataset/heart-disease.csv')
heart_disease.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
X = heart_disease.drop('target', axis=1)
y= heart_disease['target']

In [4]:
#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2 )

In [5]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((242, 13), (61, 13), (242,), (61,))

Note: This dataset does not have any missing values and all the data is already numeric so we are considering another dataset

## 1. Handling the missing values

Many machine learning models don't work well when there are missing values in the data.

There are two main options when dealing with missing values.

1. Fill them with some given value. For example, you might fill missing values of a numerical column with the mean of all the other values. The practice of filling missing values is often referred to as imputation.
2. Remove them. If a row has missing values, you may opt to remove them completely from your sample completely. However, this potentially results in using less data to build your model.

**Note:** Dealing with missing values is a problem to problem issue. And there's often no best way to do it.

In [6]:
car_sales = pd.read_csv('Dataset/car-sales-extended-missing-data.csv')
car_sales.head(5)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [7]:
#Here we are first checking the no of  missing values
car_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

**Note:** We cannot convert the data or cannot build a model before dealing with the missing values

**Method 1** : Fill Missing datas with Pandas

What we'll do is fill the rows where categorical values are missing with `"missing"`, the numerical features with the mean or 4 for the doors. And drop the rows where the Price is missing. 

We could fill Price with the mean, however, since it's the target variable, we don't want to be introducing too many fake labels.

**Note:** The practice of filling missing data is called **imputation**. And it's important to remember there's no perfect way to fill missing data. The methods we're using are only one of many. The techniques you use will depend heavily on your dataset. A good place to look would be searching for "data imputation techniques".

In [8]:
# Fill the "Make" Column
car_sales["Make"].fillna("missing", inplace=True)

# Fill the "Colour" Column
car_sales["Colour"].fillna("missing", inplace=True)

# Fill the "Odometer" Column
car_sales["Odometer (KM)"].fillna(car_sales["Odometer (KM)"].mean(), inplace=True)

# Fill the "Doors" Columns
car_sales["Doors"].fillna(4, inplace=True)

In [9]:
# Check again for na
car_sales.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [10]:
# Remove Rows with missing Price values
car_sales.dropna(inplace=True)

In [11]:
len(car_sales)

950

**Method 2:** Filling the missing value with Scikit-learn

Now we've filled the missing columns using pandas functions, you might be thinking, "Why pandas? I thought this was a Scikit-Learn introduction?".

Not to worry, scikit-learn provides another method called [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) which allows us to do a similar thing.

`SimpleImputer()` transforms data by filling missing values with a given strategy.

In [1]:
### Fill Missing values with Scikit-Learn
# from sklearn.impute import SimpleImputer

# from sklearn.compose import ColumnTransformer

### Fill Categoricall Values with missing and Numerical values with Mean
# cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
#door_imputer = SimpleImputer(strategy="constant", fill_value=4)
#num_imputer = SimpleImputer(strategy="mean")

# ### Define Columns
# cat_features=["Make","Colour"]
#door_feature = ["Doors"] 
#num_feature = ["Odometer (KM)"] 

# ### Create an Imputer (Something that fills Missing data)
# imputer = ColumnTransformer([
#     ("cat_imputer", cat_imputer, cat_features),
#     ("door_imputer", door_imputer, door_feature),
#     ("num_imputer", num_imputer, num_feature)
# ])


# ### Transform the Data
# filled_x = imputer.fit_transform(x) 

# filled_x

**Note:** We use `fit_transform()` on the training data and `transform()` on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).

## 2. Converting the non numeric variables to numeric

We cannot build a model with non numeric datatypes

**Method 1-**

In [12]:
#Split the dataset into X and y
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

In [18]:
# Lets Transform the data as now we dont have any missing values
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot, categorical_features)],
                               remainder="passthrough")

transformed_x = transformer.fit_transform(car_sales)
transformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [19]:
pd.DataFrame(transformed_x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,35431.0,15323.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0,19943.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,84714.0,28343.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,154365.0,13434.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0,14043.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0,32042.0
946,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,155144.0,5716.0
947,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0,31570.0
948,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,215883.0,4001.0


In [21]:
# Now we have got our data as numbersand filled(no missing values)
# Lets fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_x,y, test_size=0.2)
model = RandomForestRegressor()

model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9998421058539826

**Method 2-**

In [None]:
# Another way... using pandas and pd.get_dummies()
car_sales.head()

dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

# Have to convert doors to object for dummies to work on it...
car_sales["Doors"] = car_sales["Doors"].astype(object)
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

# The categorical categories are now either 1 or 0...
X["Make"].value_counts()

# Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)

model.fit(X_train, y_train)

model.score(X_test, y_test)