#### Scikit Learn also called sklearn  
#### Why Scikit?  
* Build on numpy and matplotlib
* Has many in-built ML models  
* Methods to evaluate your ML models  
* Very well-designed API  


What we are going to cover:-  

0. An end to end Scikit Learn Workflow  
1. Getting the data ready  
2. Choose the right estimator/ algorithm for our problem  
3. Fit the model/ algorithm and use it to make prediction on our data
4. Evaluating the model
5. Improve the model  
6. Save and load a trained model  
7. Putting it all together !!

#### In case we are getting warning use  
import warnings  
warnings.filterwarnings("ignore")
###### Remember warnings might be useful sometimes

#### To reset this  
import warnings  
warnings.filterwarnings("default")

### Checking the version of sklearn

In [2]:
import sklearn
sklearn.show_versions()
# or 
# sklearn.__version__


System:
    python: 3.8.11 (default, Aug  6 2021, 09:57:55) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\harsh\anaconda3\envs\tensorflow\python.exe
   machine: Windows-10-10.0.22000-SP0

Python dependencies:
          pip: 21.0.1
   setuptools: 52.0.0.post20210125
      sklearn: 0.24.2
        numpy: 1.20.3
        scipy: 1.7.1
       Cython: 0.29.24
       pandas: 1.3.2
   matplotlib: 3.4.2
       joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True


-----------------------------------------------------------------

# 0. Scikit Workflow (end-to-end)

<img src="scikit-images/sklearn-workflow-title.png">

### 1. Getting the data ready

In [3]:
import pandas as pd
import numpy as np

In [4]:
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [49]:
# Create X i.e (feature matrix)
X= heart_disease.drop("target",axis=1)

# Create y i.e (labels)
y= heart_disease["target"]

### 2. Choose the right model and hypermeters

HERE WE ARE USING CLASSIFICATION AS WE NEED TO KNOW WHETHER PATIENT HAS HEART DISEASE OR NOT i.e 0-1 condition

In [51]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.  
A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

In [52]:
# We eill keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fit the model to the training data

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2)

In [54]:
clf.fit(X_train,y_train);

In [55]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
70,54,1,2,120,258,0,0,147,0,0.4,1,0,3
177,64,1,2,140,335,0,1,158,0,0.0,2,0,2
284,61,1,0,140,207,0,0,138,1,1.9,2,1,3
29,53,1,2,130,197,1,0,152,0,1.2,0,0,2
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
50,51,0,2,130,256,0,0,149,0,0.5,2,0,2
229,64,1,2,125,309,0,1,131,1,1.8,1,0,3
37,54,1,2,150,232,0,0,165,0,1.6,2,0,3
157,35,1,1,122,192,0,1,174,0,0.0,2,0,2


In [56]:
# make a prediction
y_label= clf.predict

In [57]:
y_preds= clf.predict(X_test)
y_preds

array([0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0], dtype=int64)

In [58]:
y_test

186    0
101    1
131    1
129    1
106    1
      ..
198    0
242    0
24     1
275    0
235    0
Name: target, Length: 61, dtype: int64

### 4. Evaluate the model on training data and test data

In [62]:
clf.score(X_train, y_train)

1.0

In [63]:
clf.score(X_test,y_test)

0.8852459016393442

In [64]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [65]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.90      0.87      0.88        30
           1       0.88      0.90      0.89        31

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.89        61
weighted avg       0.89      0.89      0.89        61



In [66]:
confusion_matrix(y_test,y_preds)

array([[26,  4],
       [ 3, 28]], dtype=int64)

In [74]:
accuracy_score(y_test,y_preds)

0.8852459016393442

### 5. Improve a model

In [75]:
# Try different amount of n_estimators
np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf= RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f"Model accuracy on test set: {clf.score(X_test,y_test)*100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 78.69%

Trying model with 20 estimators...
Model accuracy on test set: 85.25%

Trying model with 30 estimators...
Model accuracy on test set: 81.97%

Trying model with 40 estimators...
Model accuracy on test set: 86.89%

Trying model with 50 estimators...
Model accuracy on test set: 83.61%

Trying model with 60 estimators...
Model accuracy on test set: 83.61%

Trying model with 70 estimators...
Model accuracy on test set: 81.97%

Trying model with 80 estimators...
Model accuracy on test set: 86.89%

Trying model with 90 estimators...
Model accuracy on test set: 80.33%



### 6. Save a model and load it

In [5]:
import pickle

Python pickle module is used for serializing and de-serializing a Python object structure.  
Any object in Python can be pickled so that it can be saved on disk.

In [77]:
pickle.dump(clf,open("random_forest_model_1.pkl","wb"))

In [78]:
loaded_model=pickle.load(open("random_forest_model_1.pkl","rb"))
loaded_model.score(X_test, y_test)

0.8032786885245902

---------------------------------------------------------------------------------

# 1. Getting our data ready to be used with machine learning  

Three main things we have to do:  
    1. Spilit the data into features and labels (usually "X" & "Y")  
    2. Filling (also called imputing) or disregarding missing values  
    3. Converting non-numerical values to numerical values (also called feature encoding)  

In [7]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [8]:
heart_disease.shape

(303, 14)

In [9]:
# X is our feature column
X= heart_disease.drop("target",axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [10]:
# Y is our target column
Y= heart_disease["target"]
Y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [11]:
X.shape, Y.shape

((303, 13), (303,))

###### We use feature column to predict target column

In [12]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train,X_test, Y_train, Y_test= train_test_split(X,Y,test_size=0.2)

In [13]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((242, 13), (61, 13), (242,), (61,))

#### Clean Data > Transform Data > Reduce Data

### 1.1 Make sure it's all numerical

In [15]:
car_sales= pd.read_csv("car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [16]:
len(car_sales)

1000

In [17]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [21]:
#dors are in categorical form just like make and color
car_sales["Doors"].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

In [18]:
# Split into X /Y
X= car_sales.drop("Price",axis=1)
Y=car_sales["Price"]

In [24]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [25]:
Y.head()

0    15323
1    19943
2    28343
3    13434
4    14043
Name: Price, dtype: int64

In [19]:
# Split into training and test
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

In [20]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor
model= RandomForestRegressor()

model.fit(X_train,Y_train)
model.score(X_test,Y_test)

ValueError: could not convert string to float: 'Nissan'

In [23]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
one_hot= OneHotEncoder()

In [28]:
categorical_features= ["Make","Colour","Doors"]
transformer= ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder= "passthrough")
transformed_X= transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [29]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [30]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


Above we have converted the Make, Colour and Doors columns and keep the Odometer column same  
#### One Hot Encoding  
* It is a process used to turn categories into numbers  
* let we have cars[0,1,2,3] with colour[red, green, blue, red] it will be converted into different numbers  
Car -- Colour ---- Car -- Red -- Green -- Blue  
0 -- red ---- 0 -- 1 -- 0 -- 0      
1 -- green ---- 1 -- 0 -- 1 -- 0    
2 -- blue ---- 2 -- 0 -- 0 -- 1  
3 -- red ---- 3  -- 1 -- 0 -- 0  

In [32]:
# Another way of doing this
dummies= pd.get_dummies(car_sales[["Make","Colour","Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


In [33]:
# Let's refit the model
np.random.seed(42)
X_train, X_test, Y_train, Y_test = train_test_split(transformed_X,Y,test_size=0.2)
model.fit(X_train,Y_train)

RandomForestRegressor()

In [34]:
model.score(X_test,Y_test)

0.3235867221569877

## 1.2 What if there were missing values?
1. Fill them with some value (also known as imputation)
2. Remove the samples with missing data altogether

In [38]:
# Import car sales missing data
car_sales_missing= pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [39]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Around 50 values are missing in each columns

In [41]:
# Create X and Y
X= car_sales_missing.drop("Price",axis=1)
Y= car_sales_missing["Price"]

In [43]:
# Lets try and convert data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
one_hot= OneHotEncoder()
categorical_features= ["Make","Colour","Doors"]
transformer= ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder= "passthrough")
transformed_X= transformer.fit_transform(X)
transformed_X

<1000x16 sparse matrix of type '<class 'numpy.float64'>'
	with 4000 stored elements in Compressed Sparse Row format>

#### IN NEWER VERSION OF SCIKIT LEARN (0.23+), THE OneHotEncoder CLASS WAS UPGRADED TO BE ABLE TO HANDLE None & NaN VALUES thats why we haven't got error here
BUT IT IS ALWAYS A GOOD PRACTICE TO DO 'CLEAN TRANSFORM REDUCED' ON DATA as data which is of no use increse cost of model ,time of model...etc

In [45]:
car_sales_missing["Doors"].value_counts()

4.0    811
5.0     75
3.0     64
Name: Doors, dtype: int64

#### Option 1: Fill missing data with Pandas

In [46]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [48]:
car_sales_missing["Make"].fillna("missing",inplace=True)
car_sales_missing["Colour"].fillna("missing",inplace=True)
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace=True)
car_sales_missing["Doors"].fillna(4,inplace=True)

In [49]:
# Check out DataFrame again
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [50]:
# Remove rows with missing price value (as they are of no use for us)
car_sales_missing.dropna(inplace=True)

In [51]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [52]:
len(car_sales_missing)

950

In [54]:
X=car_sales_missing.drop("Price",axis=1)
Y=car_sales_missing["Price"]

In [56]:
# Lets try and convert data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
one_hot= OneHotEncoder()
categorical_features= ["Make","Colour","Doors"]
transformer= ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder= "passthrough")
transformed_X= transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

-------------------------------------------------------

Once your data is all in numerical format, there's one more transformation you'll probably want to do to it.  
##### it's called Feature Scaling.
In other words, making sure all of your numerical data is on the same scale.   
For example, say you were trying to predict the sale price of cars and the number of kilometres on their odometers varies from 6,000 to 345,000 but the median previous repair cost varies from 100 to 1,700.   
A machine learning algorithm may have trouble finding patterns in these wide-ranging variables.    
To fix this, there are two main types of feature scaling.  
##### Normalization (also called min-max scaling) - 
* This rescales all the numerical values to between 0 and 1, with the lowest value being close to 0 and the highest previous value being close to 1.   
* Scikit-Learn provides functionality for this in the MinMaxScalar class.

##### Standardization - 
* This subtracts the mean value from all of the features (so the resulting features have 0 mean).
* It then scales the features to unit variance (by dividing the feature by the standard deviation). 
* Scikit-Learn provides functionality for this in the StandardScalar class.

A couple of things to note.
* Feature scaling usually isn't required for your target variable.
* Feature scaling is usually not required with tree-based models (e.g. Random Forest) since they can handle varying features.

-----------------------------

#### Option 2: Fill missing values with SciKit Learn

In [4]:
car_sales_missing=pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [5]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [6]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"],inplace= True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [7]:
# Split into X and Y
X= car_sales_missing.drop("Price",axis=1)
Y= car_sales_missing["Price"]

In [8]:
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

In [9]:
# Fill missing values with Scikit learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [12]:
# Fill categorical values with 'missing' and numerical values with mean
cat_imputer= SimpleImputer(strategy='constant',fill_value='missing')
door_imputer= SimpleImputer(strategy="constant",fill_value=4)
num_imputer= SimpleImputer(strategy="mean")

# Define columns
cat_features= ["Make","Colour"]
door_features= ["Doors"]
num_features= ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer",cat_imputer,cat_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer",num_imputer, num_features)
])

#Transform the data 
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [13]:
car_sales_filled= pd.DataFrame(filled_X,
                               columns=["Make","Colour","Doors","Odometer (KM)"])

In [14]:
car_sales_filled

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0
...,...,...,...,...
945,Toyota,Black,4.0,35820.0
946,missing,White,3.0,155144.0
947,Nissan,Blue,4.0,66604.0
948,Honda,White,4.0,215883.0


In [15]:
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [17]:
# Lets try and convert data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
one_hot= OneHotEncoder()
categorical_features= ["Make","Colour","Doors"]
transformer= ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder= "passthrough")
transformed_X= transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [19]:
# Now we have got our data as numbers and filled 
# lets fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train,X_test, Y_train, Y_test= train_test_split(transformed_X,Y,test_size=0.2)

model= RandomForestRegressor()
model.fit(X_train,Y_train)
model.score(X_test,Y_test)

0.21990196728583944