# Introduction to Scikit learn (sklearn)

### This notebook demonstrates:
#### Covering:
1. AN end-to-end Scikit Learn workflow
2. Getting the data ready
3. Choosing the right estimator/algorithm for our problems
4. Fit the model/algo and use it to make predictions on our data
5. Evaluting a model 
6. Improve a model
7. save and load a trained model 
8. putTing it all together

## 1. An end-to-end Scikit Learn workflow

In [1]:
# 1. Getting the data ready
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [2]:
# create X (features matrix)

x = heart_disease.drop("target",axis = 1)
x

#create Y (labels)
y = heart_disease["target"]
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

In [3]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier #RandomforestClassification is classsification ML Model(capable of learning the patterns and data then classify whether the data is that or not)

clf = RandomForestClassifier()

# We'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [4]:
# 3. Fit the model to traning data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y ,test_size = 0.2) #test data will be 20%(0.2) and training data will be 80%
                                                                          #we can change the test_size as our own          

In [5]:
clf.fit(x_train,y_train)
x_test



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
243,57,1,0,152,274,0,1,88,1,1.2,1,1,3
186,60,1,0,130,253,0,1,144,1,1.4,2,1,3
226,62,1,1,120,281,0,0,103,0,1.4,1,1,3
125,34,0,1,118,210,0,1,192,0,0.7,2,0,2
21,44,1,2,130,233,0,1,179,1,0.4,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
209,59,1,0,140,177,0,1,162,1,0.0,2,1,3
45,52,1,1,120,325,0,1,172,0,0.2,2,0,2
77,59,1,1,140,221,0,1,164,1,0.0,2,0,2
188,50,1,2,140,233,0,1,163,0,0.6,1,1,3


In [6]:
y_preds = clf.predict(x_test) # x_test upor vitti kore classifyy korteche row wise je heart disease ache kina

In [7]:
y_preds

array([0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0], dtype=int64)

In [8]:
y_test

243    0
186    0
226    0
125    1
21     1
      ..
209    0
45     1
77     1
188    0
168    0
Name: target, Length: 61, dtype: int64

In [9]:
# 4. Evaluate the model on the training data and the test data
clf.score(x_train,y_train)

1.0

In [10]:
clf.score(x_test,y_test)

0.8360655737704918

In [11]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
print(classification_report(y_test, y_preds)) #ekhane y_test ar y_preds shathe compare korteche

              precision    recall  f1-score   support

           0       0.88      0.75      0.81        28
           1       0.81      0.91      0.86        33

    accuracy                           0.84        61
   macro avg       0.84      0.83      0.83        61
weighted avg       0.84      0.84      0.83        61



In [12]:
confusion_matrix(y_test ,y_preds)

array([[21,  7],
       [ 3, 30]], dtype=int64)

In [13]:
accuracy_score(y_test,y_preds)

0.8360655737704918

In [14]:
# 5. Improve  a model 
# Try different amount of n_estimators
#np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(x_train,y_train)
    print(f"Model accuracy on test set : {clf.score(x_test,y_test)*100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set : 86.89%

Trying model with 20 estimators...
Model accuracy on test set : 83.61%

Trying model with 30 estimators...
Model accuracy on test set : 83.61%

Trying model with 40 estimators...
Model accuracy on test set : 83.61%

Trying model with 50 estimators...
Model accuracy on test set : 81.97%

Trying model with 60 estimators...
Model accuracy on test set : 81.97%

Trying model with 70 estimators...
Model accuracy on test set : 80.33%

Trying model with 80 estimators...
Model accuracy on test set : 80.33%

Trying model with 90 estimators...
Model accuracy on test set : 81.97%



In [15]:
# 6. save the model and load it
import pickle
pickle.dump(clf, open("random_forest_model_1.pkl","wb"))


In [16]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(x_test,y_test)

0.819672131147541

In [17]:
loaded_model.score(x_test,y_test)

0.819672131147541

In [18]:
import sklearn
sklearn.show_versions()


System:
    python: 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\USER\anaconda3\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
          pip: 21.2.4
   setuptools: 61.2.0
      sklearn: 1.0.2
        numpy: 1.21.5
        scipy: 1.7.3
       Cython: 0.29.28
       pandas: 1.4.2
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True




## 1.Getting the data ready to be used with machine learning
 ### Three main things we have to do :
     ### 1.Split the data into features and labels (usually "X" & "y")
     ### 2.Filling (also called imputing) or disregarding missing values
     ### 3.Converting non-numerical values to numerical values (also called features encoding)

In [19]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [20]:
## 1. spilt the data
x = heart_disease.drop("target",axis = 1) 
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
y = heart_disease["target"]

In [22]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [23]:
# split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=.2)

In [24]:
X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
198,62,1,0,120,267,0,1,99,1,1.8,1,2,3
128,52,0,2,136,196,0,0,169,0,0.1,1,0,2
294,44,1,0,120,169,0,1,144,1,2.8,0,0,1
249,69,1,2,140,254,0,0,146,0,2.0,1,3,3
134,41,0,1,126,306,0,1,163,0,0.0,2,0,2


In [25]:
X_train.shape, X_test.shape ,Y_train.shape,Y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [26]:
x.shape

(303, 13)

In [27]:
len(heart_disease)

303

# 1.1 Make sure it's all numerical

In [28]:
car_sales = pd.read_csv("car-sales.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


In [29]:
car_sales["Price"] = car_sales["Price"].str.replace('[\$\,\.]', '')
car_sales

  car_sales["Price"] = car_sales["Price"].str.replace('[\$\,\.]', '')


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,400000
1,Honda,Red,87899,4,500000
2,Toyota,Blue,32549,3,700000
3,BMW,Black,11179,5,2200000
4,Nissan,White,213095,4,350000
5,Toyota,Green,99213,4,450000
6,Honda,Blue,45698,4,750000
7,Honda,Blue,54738,4,700000
8,Toyota,White,60000,4,625000
9,Nissan,White,31600,4,970000


In [30]:
car_sales["Price"] = car_sales["Price"].str[:-2]
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,4000
1,Honda,Red,87899,4,5000
2,Toyota,Blue,32549,3,7000
3,BMW,Black,11179,5,22000
4,Nissan,White,213095,4,3500
5,Toyota,Green,99213,4,4500
6,Honda,Blue,45698,4,7500
7,Honda,Blue,54738,4,7000
8,Toyota,White,60000,4,6250
9,Nissan,White,31600,4,9700


In [31]:
car_sales["Price"] = car_sales["Price"].astype(int)

In [32]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int32
dtype: object

# Split into x/y

In [33]:
x = car_sales.drop("Price",axis=1)
y = car_sales["Price"]


In [34]:
# Split into training and test
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.2)

In [35]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor ## RandomForestRegressor diye number guess hobe like price guess kora

model = RandomForestRegressor()
model.fit(x_train,y_train)
model.score(x_test,y_test)

ValueError: could not convert string to float: 'Toyota'

In [36]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder = "passthrough")

transformed_x = transformer.fit_transform(x)
transformed_x

array([[0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 1.50043e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 8.78990e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
        0.00000e+00, 0.00000e+00, 3.25490e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
        0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 1.00000e+00, 1.11790e+04],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 2.13095e+05],
       [0.00000e+00, 0.00000e+

In [37]:
pd.DataFrame(transformed_x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,150043.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,87899.0
2,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,32549.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,11179.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,213095.0
5,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,99213.0
6,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,45698.0
7,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,54738.0
8,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,60000.0
9,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,31600.0


In [38]:
dummies = pd.get_dummies(car_sales[["Make", "Colour","Doors"]])
dummies


Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,0,0,1,0,0,0,0,1
1,4,0,1,0,0,0,0,0,1,0
2,3,0,0,0,1,0,1,0,0,0
3,5,1,0,0,0,1,0,0,0,0
4,4,0,0,1,0,0,0,0,0,1
5,4,0,0,0,1,0,0,1,0,0
6,4,0,1,0,0,0,1,0,0,0
7,4,0,1,0,0,0,1,0,0,0
8,4,0,0,0,1,0,0,0,0,1
9,4,0,0,1,0,0,0,0,0,1


In [39]:
# Lets refit the model 
np.random.seed(42)
x_train,x_test,y_train,y_test = train_test_split(transformed_x,y, test_size=0.2)

In [40]:
model.fit(x_train,y_train)

RandomForestRegressor()

In [41]:
model.score(x_test,y_test)

-1.2793638399999998

## What if there were missing valeus?


In [42]:
car_sales_missing_data = pd.read_csv("car-sales-missing-data.csv")

In [43]:
car_sales_missing_data

Unnamed: 0,Make,Colour,Odometer,Doors,Price
0,Toyota,White,150043.0,4.0,"$4,000"
1,Honda,Red,87899.0,4.0,"$5,000"
2,Toyota,Blue,,3.0,"$7,000"
3,BMW,Black,11179.0,5.0,"$22,000"
4,Nissan,White,213095.0,4.0,"$3,500"
5,Toyota,Green,,4.0,"$4,500"
6,Honda,,,4.0,"$7,500"
7,Honda,Blue,,4.0,
8,Toyota,White,60000.0,,
9,,White,31600.0,4.0,"$9,700"


In [44]:
car_sales_missing_data.isna().sum() #shows us how many missing values are there

Make        1
Colour      1
Odometer    4
Doors       1
Price       2
dtype: int64

In [45]:
# create x and y
x = car_sales_missing_data.drop("Price", axis = 1)
y = car_sales_missing_data["Price"]


In [46]:
# lets try and convert our date to numbers

# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder = "passthrough")

transformed_x = transformer.fit_transform(x)
transformed_x


<10x16 sparse matrix of type '<class 'numpy.float64'>'
	with 40 stored elements in Compressed Sparse Row format>

In [47]:
# print(sklearn.__version__)

In [48]:
car_sales_missing_data["Doors"].value_counts()

4.0    7
3.0    1
5.0    1
Name: Doors, dtype: int64

#### 1.Fill missing data with pandas

In [49]:
# numerical column gulo mean diye fill up korle valo hoy

#fill the "make" column 
car_sales_missing_data["Make"].fillna("missing",inplace = True)

# # fill the "colour" column
car_sales_missing_data["Colour"].fillna("missing",inplace=True)

# # fill the "Odometer (Km)" column
car_sales_missing_data["Odometer"].fillna(car_sales_missing_data["Odometer"].mean(),inplace=True)

#fill the "Doors" column
car_sales_missing_data["Doors"].fillna(4,inplace=True)


In [50]:
car_sales_missing_data

Unnamed: 0,Make,Colour,Odometer,Doors,Price
0,Toyota,White,150043.0,4.0,"$4,000"
1,Honda,Red,87899.0,4.0,"$5,000"
2,Toyota,Blue,92302.666667,3.0,"$7,000"
3,BMW,Black,11179.0,5.0,"$22,000"
4,Nissan,White,213095.0,4.0,"$3,500"
5,Toyota,Green,92302.666667,4.0,"$4,500"
6,Honda,missing,92302.666667,4.0,"$7,500"
7,Honda,Blue,92302.666667,4.0,
8,Toyota,White,60000.0,4.0,
9,missing,White,31600.0,4.0,"$9,700"


In [51]:
car_sales_missing_data.isna().sum()

Make        0
Colour      0
Odometer    0
Doors       0
Price       2
dtype: int64

In [52]:
# Remove rows with missing price value
car_sales_missing_data.dropna(inplace=True)

In [53]:
car_sales_missing_data.isna().sum()

Make        0
Colour      0
Odometer    0
Doors       0
Price       0
dtype: int64

In [54]:
len(car_sales_missing_data)

8

In [55]:
x = car_sales_missing_data.drop("Price", axis = 1)
y = car_sales_missing_data["Price"]

In [56]:
# lets try and convert our date to numbers

# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder = "passthrough")

transformed_x = transformer.fit_transform(car_sales_missing_data)
transformed_x



array([[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0,
        0.0, 150043.0, '$4,000'],
       [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0,
        0.0, 87899.0, '$5,000'],
       [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,
        0.0, 92302.66666666667, '$7,000'],
       [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
        1.0, 11179.0, '$22,000'],
       [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0,
        0.0, 213095.0, '$3,500'],
       [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0,
        0.0, 92302.66666666667, '$4,500'],
       [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0,
        0.0, 92302.66666666667, '$7,500'],
       [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0,
        0.0, 31600.0, '$9,700']], dtype=object)

In [57]:
car_sales_missing_data

Unnamed: 0,Make,Colour,Odometer,Doors,Price
0,Toyota,White,150043.0,4.0,"$4,000"
1,Honda,Red,87899.0,4.0,"$5,000"
2,Toyota,Blue,92302.666667,3.0,"$7,000"
3,BMW,Black,11179.0,5.0,"$22,000"
4,Nissan,White,213095.0,4.0,"$3,500"
5,Toyota,Green,92302.666667,4.0,"$4,500"
6,Honda,missing,92302.666667,4.0,"$7,500"
9,missing,White,31600.0,4.0,"$9,700"


# 2. Fill missing values with Scikit-Learn

In [58]:
car_sales_missing_data = pd.read_csv("car-sales-missing-data.csv")
car_sales_missing_data.head()

Unnamed: 0,Make,Colour,Odometer,Doors,Price
0,Toyota,White,150043.0,4.0,"$4,000"
1,Honda,Red,87899.0,4.0,"$5,000"
2,Toyota,Blue,,3.0,"$7,000"
3,BMW,Black,11179.0,5.0,"$22,000"
4,Nissan,White,213095.0,4.0,"$3,500"


In [59]:
car_sales_missing_data.isna().sum()

Make        1
Colour      1
Odometer    4
Doors       1
Price       2
dtype: int64

In [60]:
# car_sales_missing_data.dropna(subset=[""])
car_sales_missing_data

Unnamed: 0,Make,Colour,Odometer,Doors,Price
0,Toyota,White,150043.0,4.0,"$4,000"
1,Honda,Red,87899.0,4.0,"$5,000"
2,Toyota,Blue,,3.0,"$7,000"
3,BMW,Black,11179.0,5.0,"$22,000"
4,Nissan,White,213095.0,4.0,"$3,500"
5,Toyota,Green,,4.0,"$4,500"
6,Honda,,,4.0,"$7,500"
7,Honda,Blue,,4.0,
8,Toyota,White,60000.0,,
9,,White,31600.0,4.0,"$9,700"


In [61]:
# Drop the rows with no labels
car_sales_missing_data.dropna(subset=["Price"],inplace=True)
car_sales_missing_data.isna().sum()
car_sales_missing_data["Price"] = car_sales_missing_data["Price"].str.replace('[\$\,\.]', '')

  car_sales_missing_data["Price"] = car_sales_missing_data["Price"].str.replace('[\$\,\.]', '')


In [62]:
car_sales_missing_data["Price"] = car_sales_missing_data["Price"].str[:-2]
car_sales_missing_data

Unnamed: 0,Make,Colour,Odometer,Doors,Price
0,Toyota,White,150043.0,4.0,40
1,Honda,Red,87899.0,4.0,50
2,Toyota,Blue,,3.0,70
3,BMW,Black,11179.0,5.0,220
4,Nissan,White,213095.0,4.0,35
5,Toyota,Green,,4.0,45
6,Honda,,,4.0,75
9,,White,31600.0,4.0,97


In [63]:
# type(car_sales_missing_data["Price"][0])

car_sales_missing_data["Price"] = car_sales_missing_data["Price"].astype(int)
car_sales_missing_data

type(car_sales_missing_data["Price"][0])

numpy.int32

In [64]:
#split into x,y
x= car_sales_missing_data.drop("Price",axis = 1)
y = car_sales_missing_data["Price"]



In [65]:
# Fill missing values with scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant",fill_value="missing")
door_imputer = SimpleImputer(strategy="constant",fill_value=4)
num_imputer = SimpleImputer(strategy="mean")


# Define columns
cat_features = ["Make","Colour"]
door_features = ["Doors"]
num_features = ["Odometer"]

#Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer" , cat_imputer, cat_features),
    ("door_imputer" , door_imputer,door_features),
    ("num_imputer", num_imputer, num_features)
])

#Transform the data
filled_x = imputer.fit_transform(x)
filled_x

array([['Toyota', 'White', 4.0, 150043.0],
       ['Honda', 'Red', 4.0, 87899.0],
       ['Toyota', 'Blue', 3.0, 98763.2],
       ['BMW', 'Black', 5.0, 11179.0],
       ['Nissan', 'White', 4.0, 213095.0],
       ['Toyota', 'Green', 4.0, 98763.2],
       ['Honda', 'missing', 4.0, 98763.2],
       ['missing', 'White', 4.0, 31600.0]], dtype=object)

In [66]:
car_sales_filled = pd.DataFrame(filled_x,columns=["Make","colours","Doors","Odometer"])

In [67]:
car_sales_filled

Unnamed: 0,Make,colours,Doors,Odometer
0,Toyota,White,4.0,150043.0
1,Honda,Red,4.0,87899.0
2,Toyota,Blue,3.0,98763.2
3,BMW,Black,5.0,11179.0
4,Nissan,White,4.0,213095.0
5,Toyota,Green,4.0,98763.2
6,Honda,missing,4.0,98763.2
7,missing,White,4.0,31600.0


In [68]:
car_sales_filled.isna().sum()

Make        0
colours     0
Doors       0
Odometer    0
dtype: int64

In [69]:
# lets try and convert our date to numbers

# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","colours","Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder = "passthrough")

transformed_x = transformer.fit_transform(car_sales_filled)
transformed_x

<8x15 sparse matrix of type '<class 'numpy.float64'>'
	with 32 stored elements in Compressed Sparse Row format>

In [70]:
# Now we've got our data as numbers and filled (no missing values)
#lets fit a model
# np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(transformed_x,y,test_size = .2)
model = RandomForestRegressor()
model.fit(x_train, y_train)
model.score(x_test,y_test)

0.1615836734693874

# 2. choosing the right estimator/algorithm for your problem

### Some things to note:

   ### 1. Sklearn refers to machine learning models, algorithms as estimators.
   ### 2. Classification problem - predecting a category (heart disease or not)
       ## 2.1 Sometimes you'll see "clf" (short for classifier) used as a classification estimator
   ### 3.Regrassion problem - predecting a number (selling price of a car)
   
   
   If you are working on a machine learing problem and looking to use Sklearn and not sure what model you should use, refer to the sklearn machine learning map : https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

# 2.1 Picking a machine learning model for a regrassion problem
 ## Lets use the Califonia Housing dataset. link - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing

In [71]:
# Get califonia Housing Dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [72]:
housing_df = pd.DataFrame(housing["data"],columns=housing["feature_names"])

In [73]:
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [74]:
housing_df["MedHouseVal"] = housing["target"]

In [75]:
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [76]:
# Import algorithm
from sklearn.linear_model import Ridge
# setup random seed
np.random.seed(42)

#Create the data
x = housing_df.drop("MedHouseVal",axis = 1)
y = housing_df["MedHouseVal"] # median house price in $100000

# Split into train and test sets

x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=.2)

# Instantiate and fit the model (on the training set)

model = Ridge()
model.fit(x_train, y_train)

#cheak the score of the model (on the test set)
model.score(x_test,y_test)

0.5758549611440126

## What if "Ridge" didn't work or the score didn't fit our need ?
###  well, we could always try a different model

### How about we try an ensemble model (an ensemble is combination od smaller models to try and make better precdiction than just a single model

### Sklearn ensemble models can be found here : https://scikit-learn.org/stable/modules/ensemble.html

In [77]:
# Import the RandomForestRegrassor model class from the ensemble module

from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
 
x = housing_df.drop("MedHouseVal",axis = 1)
y = housing_df["MedHouseVal"]


# Split into train and test sets

x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=.2)

# Create random forest model(on the training set)

model = RandomForestRegressor()
model.fit(x_train, y_train)

#cheak the score of the model (on the test set)
model.score(x_test,y_test)

0.8065734772187598

## 2.2 Picking an estimator for a classification problem


In [78]:
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [79]:
len(heart_disease)

303

### consulting the map and it says ti try LinearSVC

In [83]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# Setup random seed
np.random.seed(42)

# Make the data 
X = heart_disease.drop("target",axis = 1)
Y = heart_disease["target"]


# split the data 
x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size=0.2)

# Instantiate LinearSVC
clf = LinearSVC()
clf.fit(x_train,y_train)

# Evaluate the LinearSVC
clf.score(x_test,y_test)



0.8688524590163934

In [84]:
# Import the ranndomforestClassigfier estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data 
X = heart_disease.drop("target",axis = 1)
Y = heart_disease["target"]


# split the data 
x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(x_train,y_train)

# Evaluate the RandomForestClassifier
clf.score(x_test,y_test)

0.8524590163934426

## Tidbit :
    *1. If you have structured data, used ensemble methods
    *2. If you have unstructure data, use deep learning or transfer learning

## 3. Fit the model/algorithm on our data and use it to make predictions

### 3.1 fitting the model to data

Different names for:
 * 'X' = features,features variables, data
 * 'Y' = labels,targets,taget variables


In [87]:
# Import the ranndomforestClassigfier estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data 
X = heart_disease.drop("target",axis = 1)
Y = heart_disease["target"]


# split the data 
x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier()

#fit the model to data
clf.fit(x_train,y_train)

# Evaluate the RandomForestClassifier (use the patterns the model has learnt)
clf.score(x_test,y_test)

0.8524590163934426

In [88]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [89]:
Y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

## 3.2 Make predictions using a machine learning model

#### 2 ways to make predictions :
       * 1. predict()
       * 2. predict_proba()

In [90]:
# Use a trained model to make predictions
clf.predict(np.array([1,5,7,8,4])) # this doesnt work



ValueError: Expected 2D array, got 1D array instead:
array=[1. 5. 7. 8. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [92]:
x_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2


In [91]:
clf.predict(x_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [94]:
np.array(y_test)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [95]:
# Compare predictions to truth labels to evalute the model
y_preds = clf.predict(x_test)
np.mean(y_preds == y_test)

0.8524590163934426

In [96]:
clf.score(x_test,y_test)

0.8524590163934426

In [97]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_preds)

0.8524590163934426

### Make predictions with predict_proba()

In [98]:
# predict_proba() returns probabilites of a classification label

clf.predict_proba(X_test[:5])

array([[0.02, 0.98],
       [0.13, 0.87],
       [0.81, 0.19],
       [0.89, 0.11],
       [0.15, 0.85]])

In [99]:
clf.predict(x_test[:5])

array([0, 1, 1, 0, 1], dtype=int64)

In [105]:
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


### Predict() can also be used for regrassion models

In [107]:

from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Make the data 
X = housing_df.drop("MedHouseVal", axis = 1)
Y = housing_df["MedHouseVal"]


# split the data 
x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size=0.2)

# create model instance
model = RandomForestRegressor()

#fit the model to data
model.fit(x_train,y_train)

# Make predictions
y_preds = model.predict(x_test)

In [109]:
y_preds 

array([0.49384  , 0.75494  , 4.9285964, ..., 4.8363785, 0.71782  ,
       1.67901  ])