##1. Linear Regression

*   Used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s).
*   relationship between independent and dependent variables by fitting a best line 

*   best fit line is known as regression line and represented by a linear equation Y= a *X + b



In [None]:
# importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train.csv')
test_data = pd.read_csv('/content/sample_data/test.csv')

In [None]:
#print first 5 rows
print(train_data.head())

   Item_Weight  ...  Outlet_Type_Supermarket Type3
0     6.800000  ...                              0
1    15.600000  ...                              0
2    12.911575  ...                              1
3    11.800000  ...                              0
4    17.850000  ...                              0

[5 rows x 36 columns]


In [None]:
# shape of the dataset(return the count of no. of ROWS & COLUMNS)
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)


Shape of training data : (1364, 36)

Shape of testing data : (341, 36)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Item_Outlet_Sales

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']

In [None]:
# seperate the independent and target variable on training data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']

In [None]:
'''
Create the object of the Linear Regression model
You can also add other parameters and test your code here
Some parameters are : fit_intercept and normalize
Documentation of sklearn LinearRegression: 

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

 '''
model = LinearRegression()

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

LinearRegression()

In [None]:
# coefficeints of the trained model
print('\nCoefficient of model :', model.coef_)


Coefficient of model : [-3.84197604e+00  9.83065945e+00  1.61711856e+01  6.09197622e+01
 -8.64161561e+01  1.23593376e+02  2.34714039e+02 -2.44597425e+02
 -2.72938329e+01 -8.09611456e+00 -3.01147840e+02  1.70727611e+02
 -5.40194744e+01  7.34248834e+01  1.70313375e+00 -5.07701615e+01
  1.63553657e+02 -5.85286125e+01  1.04913492e+02 -6.01944874e+01
  1.98948206e+02 -1.40959023e+02  1.19426257e+02  2.66382669e+01
 -1.85619792e+02  1.43925357e+03  2.16134663e+02  3.54723990e+01
  3.54832996e+02 -5.54559635e+00 -3.49287400e+02 -1.39202954e+03
 -2.57982359e+02 -9.59016062e+02  2.60902796e+03]


In [None]:
# intercept of the model
print('\nIntercept of model',model.intercept_)


Intercept of model -121926.97473298332


In [None]:
# predict the target on the test dataset
predict_train = model.predict(train_x)
print('\nItem_Outlet_Sales on training data',predict_train)


Item_Outlet_Sales on training data [ 803.88817641 1733.98835979 3294.52154482 ...  811.16967914 2343.96927185
 2444.98869913]


In [None]:
# Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)


RMSE on train dataset :  1135.8159344155245


In [None]:
# predict the target on the testing dataset
predict_test = model.predict(test_x)
print('\nItem_Outlet_Sales on test data',predict_test) 


Item_Outlet_Sales on test data [ 1615.37962439  3168.60806673  2564.31326686  2685.29698657
  2771.82059109  4223.3788671   2615.10827403   565.8088248
  4000.68496927  1035.54578573  2184.60316447  1033.54185437
   150.22804639  1616.19932803  2370.37858454  1953.693325
  2307.09514556  1429.85271583  2343.42149697  3780.28905363
   583.44339124  1089.08346168  2323.64661483  3559.90832258
  1829.46789667  1602.03985138   840.70282292  1823.14253132
  3145.30906529  1823.30397678  2103.35401623  3025.02597477
  2265.03907268   697.33936172  4474.05156681  2270.45195749
  1897.45212218  3305.0110824   2228.36615412  3767.90052861
  2162.33844917   665.40410258  -926.22966666   738.30407877
   197.90808777  2483.25075805  3693.05388376  2458.43116228
  1329.02544771   -57.67123156  1952.26612825  3614.4167807
  2127.22359714  2486.1932574   1826.90446272   786.7283994
  3200.67525412  1981.66000538  2326.98747373  3535.12951812
    53.4756877    129.4629475   4259.8975191   3732.152259

In [None]:
# Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)


RMSE on test dataset :  1009.2517232209692


##2. Logistic Regression

*   used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s).
*   it predicts the probability of occurrence of an event by fitting data to a logit function
*   also known as logit regression
*   its output values lies between 0 and 1 (as expected).



In [None]:
'''
The following code is for Logistic Regression
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression #Main for LogisticRegression()
from sklearn.metrics import accuracy_score  


In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
print(train_data.head())
print(test_data.head())

   Survived        Age     Fare  ...  Embarked_C  Embarked_Q  Embarked_S
0         0  28.500000   7.2292  ...           1           0           0
1         1  27.000000  10.5000  ...           0           0           1
2         1  29.699118  16.1000  ...           0           0           1
3         0  29.699118   0.0000  ...           0           0           1
4         0  17.000000   8.6625  ...           0           0           1

[5 rows x 25 columns]
   Survived   Age      Fare  ...  Embarked_C  Embarked_Q  Embarked_S
0         0  35.0    7.1250  ...           0           0           1
1         0  20.0    7.0500  ...           0           0           1
2         0  26.0    7.8958  ...           0           0           1
3         1  58.0  146.5208  ...           1           0           0
4         1  35.0   83.4750  ...           0           0           1

[5 rows x 25 columns]


In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']


In [None]:
print(train_x)

           Age      Fare  Pclass_1  ...  Embarked_C  Embarked_Q  Embarked_S
0    28.500000    7.2292         0  ...           1           0           0
1    27.000000   10.5000         0  ...           0           0           1
2    29.699118   16.1000         0  ...           0           0           1
3    29.699118    0.0000         1  ...           0           0           1
4    17.000000    8.6625         0  ...           0           0           1
..         ...       ...       ...  ...         ...         ...         ...
707  24.000000   69.3000         1  ...           1           0           0
708  22.000000    7.2500         0  ...           0           0           1
709  29.699118  221.7792         1  ...           0           0           1
710  12.000000   11.2417         0  ...           1           0           0
711  36.000000   10.5000         0  ...           0           0           1

[712 rows x 24 columns]


In [None]:
print(train_y)

0      0
1      1
2      1
3      0
4      0
      ..
707    1
708    0
709    0
710    1
711    0
Name: Survived, Length: 712, dtype: int64


In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

In [None]:
print(test_x)

           Age      Fare  Pclass_1  ...  Embarked_C  Embarked_Q  Embarked_S
0    35.000000    7.1250         0  ...           0           0           1
1    20.000000    7.0500         0  ...           0           0           1
2    26.000000    7.8958         0  ...           0           0           1
3    58.000000  146.5208         1  ...           1           0           0
4    35.000000   83.4750         1  ...           0           0           1
..         ...       ...       ...  ...         ...         ...         ...
174  65.000000   26.5500         1  ...           0           0           1
175  19.000000   13.0000         0  ...           0           0           1
176  44.000000    8.0500         0  ...           0           0           1
177  59.000000    7.2500         0  ...           0           0           1
178  29.699118   39.6000         1  ...           1           0           0

[179 rows x 24 columns]


In [None]:
print(test_y)

0      0
1      0
2      0
3      1
4      1
      ..
174    0
175    0
176    0
177    0
178    0
Name: Survived, Length: 179, dtype: int64


In [None]:
'''
Create the object of the Logistic Regression model
You can also add other parameters and test your code here
Some parameters are : fit_intercept and penalty
Documentation of sklearn LogisticRegression: 

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

 '''
model = LogisticRegression()

In [None]:
print(model)

LogisticRegression()


In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [None]:
print(model)

LogisticRegression()


In [None]:
# coefficeints of the trained model
print('Coefficient of model :', model.coef_)

Coefficient of model : [[-0.03112606  0.00155629  0.93299841  0.08451959 -1.02556785  1.24541941
  -1.25346925  1.05047794  0.97898932  0.61562405 -1.14084292 -0.78091604
  -0.28356149 -0.4478207   0.16173065  0.6339807  -0.04705229  0.20461808
  -0.45766539 -0.33677639 -0.16688521  0.07948039  0.28573972 -0.37326995]]


In [None]:
# intercept of the model
print('Intercept of model',model.intercept_)

Intercept of model [0.07227482]


In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) 

Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1
 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
 0 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1
 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0
 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 1 1 1
 0 0 1 0 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1
 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0 1
 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0
 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0
 0 0

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

accuracy_score on train dataset :  0.8047752808988764


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) 

Target on test data [0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1
 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

accuracy_score on test dataset :  0.8324022346368715


##3. Decision Tree

*   It is a type of **supervised learning algorithm** that is mostly used for ***classification problems***
*   it works for **both categorical and continuous dependent variables**
*   split the population into two or more homogeneous sets




In [None]:
# importing required libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier  #Main for DecisionTreeClassifier()
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
print(train_data)

     Survived        Age      Fare  ...  Embarked_C  Embarked_Q  Embarked_S
0           0  28.500000    7.2292  ...           1           0           0
1           1  27.000000   10.5000  ...           0           0           1
2           1  29.699118   16.1000  ...           0           0           1
3           0  29.699118    0.0000  ...           0           0           1
4           0  17.000000    8.6625  ...           0           0           1
..        ...        ...       ...  ...         ...         ...         ...
707         1  24.000000   69.3000  ...           1           0           0
708         0  22.000000    7.2500  ...           0           0           1
709         0  29.699118  221.7792  ...           0           0           1
710         1  12.000000   11.2417  ...           1           0           0
711         0  36.000000   10.5000  ...           0           0           1

[712 rows x 25 columns]


In [None]:
print(test_data)

     Survived        Age      Fare  ...  Embarked_C  Embarked_Q  Embarked_S
0           0  35.000000    7.1250  ...           0           0           1
1           0  20.000000    7.0500  ...           0           0           1
2           0  26.000000    7.8958  ...           0           0           1
3           1  58.000000  146.5208  ...           1           0           0
4           1  35.000000   83.4750  ...           0           0           1
..        ...        ...       ...  ...         ...         ...         ...
174         0  65.000000   26.5500  ...           0           0           1
175         0  19.000000   13.0000  ...           0           0           1
176         0  44.000000    8.0500  ...           0           0           1
177         0  59.000000    7.2500  ...           0           0           1
178         0  29.699118   39.6000  ...           1           0           0

[179 rows x 25 columns]


In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
print(train_x)

           Age      Fare  Pclass_1  ...  Embarked_C  Embarked_Q  Embarked_S
0    28.500000    7.2292         0  ...           1           0           0
1    27.000000   10.5000         0  ...           0           0           1
2    29.699118   16.1000         0  ...           0           0           1
3    29.699118    0.0000         1  ...           0           0           1
4    17.000000    8.6625         0  ...           0           0           1
..         ...       ...       ...  ...         ...         ...         ...
707  24.000000   69.3000         1  ...           1           0           0
708  22.000000    7.2500         0  ...           0           0           1
709  29.699118  221.7792         1  ...           0           0           1
710  12.000000   11.2417         0  ...           1           0           0
711  36.000000   10.5000         0  ...           0           0           1

[712 rows x 24 columns]


In [None]:
print(train_y)

0      0
1      1
2      1
3      0
4      0
      ..
707    1
708    0
709    0
710    1
711    0
Name: Survived, Length: 712, dtype: int64


In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

In [None]:
print(test_x)

           Age      Fare  Pclass_1  ...  Embarked_C  Embarked_Q  Embarked_S
0    35.000000    7.1250         0  ...           0           0           1
1    20.000000    7.0500         0  ...           0           0           1
2    26.000000    7.8958         0  ...           0           0           1
3    58.000000  146.5208         1  ...           1           0           0
4    35.000000   83.4750         1  ...           0           0           1
..         ...       ...       ...  ...         ...         ...         ...
174  65.000000   26.5500         1  ...           0           0           1
175  19.000000   13.0000         0  ...           0           0           1
176  44.000000    8.0500         0  ...           0           0           1
177  59.000000    7.2500         0  ...           0           0           1
178  29.699118   39.6000         1  ...           1           0           0

[179 rows x 24 columns]


In [None]:
print(test_y)

0      0
1      0
2      0
3      1
4      1
      ..
174    0
175    0
176    0
177    0
178    0
Name: Survived, Length: 179, dtype: int64


In [None]:
'''
Create the object of the Decision Tree model
You can also add other parameters and test your code here
Some parameters are : max_depth and max_features
Documentation of sklearn DecisionTreeClassifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

 '''
model = DecisionTreeClassifier()

In [None]:
print(model)

DecisionTreeClassifier()


In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

DecisionTreeClassifier()

In [None]:
# depth of the decision tree
print('Depth of the Decision Tree :', model.get_depth())

Depth of the Decision Tree : 19


In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) 

Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0
 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0
 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0
 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1
 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0
 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1
 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 0 1 1 0 1 1 1
 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 0
 0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
 0 0

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

accuracy_score on train dataset :  0.9859550561797753


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) 

Target on test data [0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0
 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0
 0 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0
 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0]


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) 

Target on test data [0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0
 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0
 0 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0
 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

accuracy_score on test dataset :  0.7653631284916201


##4. SVM (Support Vector Machine)

*   In this algorithm, we plot each data item as a **point in n-dimensional space** (where n is number of features you have) with the value of each feature being the value of a particular coordinate.
*   List item



In [None]:

# importing required libraries
import pandas as pd
from sklearn.svm import SVC   #Main
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']


In [None]:
'''
Create the object of the Support Vector Classifier model
You can also add other parameters and test your code here
Some parameters are : kernal and degree
Documentation of sklearn Support Vector Classifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

 '''
model = SVC()

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

SVC()

In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) 

Target on train data [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0
 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
 0 0

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

accuracy_score on train dataset :  0.651685393258427


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)

Target on test data [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [None]:

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

accuracy_score on test dataset :  0.7262569832402235


##5. Naive Bayes


*   It is a classification technique based on **Bayes’ theorem** with an assumption of independence between predictors.
*   Naive Bayes classifier **assumes that the presence of a particular feature in a class** is **unrelated to the presence of any other feature**

Here,

    P(c|x) is the posterior probability of class (target) given predictor (attribute).
    P(c) is the prior probability of class.
    P(x|c) is the likelihood which is the probability of predictor given class.
    P(x) is the prior probability of predictor.




In [None]:
# importing required libraries
import pandas as pd
from sklearn.naive_bayes import GaussianNB #Main
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']


In [None]:
'''
Create the object of the Naive Bayes model
You can also add other parameters and test your code here
Some parameters are : var_smoothing
Documentation of sklearn GaussianNB: 

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

 '''
model = GaussianNB()

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

GaussianNB()

In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) 

Target on train data [1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 0 1
 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 1 1

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

accuracy_score on train dataset :  0.44803370786516855


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) 

Target on test data [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

accuracy_score on test dataset :  0.35195530726256985


##6. kNN (k- Nearest Neighbors)


*   It can be used for both classification and regression problems
*   K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors

Things to consider before selecting kNN:

*   KNN is computationally expensive
*   Variables should be normalized else higher range variables can bias it
*   Works on pre-processing stage more before going for kNN like an outlier, noise removal






In [None]:
# importing required libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier #Main
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)


Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

In [None]:
'''
Create the object of the K-Nearest Neighbor model
You can also add other parameters and test your code here
Some parameters are : n_neighbors, leaf_size
Documentation of sklearn K-Neighbors Classifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

 '''
model = KNeighborsClassifier()  

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

KNeighborsClassifier()

In [None]:
# Number of Neighbors used to predict the target
print('\nThe number of neighbors used to predict the target : ',model.n_neighbors)


The number of neighbors used to predict the target :  5


In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 


Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0
 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0
 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 1 0 1 0
 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
 0 0 0 1 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0
 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0
 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1
 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0
 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 1 1 1
 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0
 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0
 0 

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

accuracy_score on train dataset :  0.8047752808988764


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) 

Target on test data [0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0
 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 1
 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

accuracy_score on test dataset :  0.7262569832402235


##7. K-Means

*   It is a type of **unsupervised algorithm** which solves the clustering problem
*   Its procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters).
*   Data points inside a cluster are homogeneous and heterogeneous to peer groups.



In [None]:
# importing required libraries
import pandas as pd
from sklearn.cluster import KMeans

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to divide the training data into differernt clusters
# and predict in which cluster a particular data point belongs.  

'''
Create the object of the K-Means model
You can also add other parameters and test your code here
Some parameters are : n_clusters and max_iter
Documentation of sklearn KMeans: 

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
 '''

model = KMeans()  

In [None]:
# fit the model with the training data
model.fit(train_data)

KMeans()

In [None]:
# Number of Clusters
print('\nDefault number of Clusters : ',model.n_clusters)


Default number of Clusters :  8


In [None]:
# predict the clusters on the train dataset
predict_train = model.predict(train_data)
print('\nCLusters on train data',predict_train)


CLusters on train data [0 0 0 0 0 0 6 0 0 6 5 0 7 7 5 6 6 7 6 0 0 6 6 6 0 0 0 4 0 0 4 5 0 0 6 0 0
 0 0 0 6 0 0 4 0 5 0 0 4 5 3 0 0 0 7 4 0 0 7 0 0 3 3 7 5 4 0 5 7 6 0 0 4 0
 0 1 7 6 0 0 0 5 4 6 1 4 6 0 0 7 4 5 7 7 6 4 5 0 7 6 0 4 0 7 5 7 0 0 5 4 5
 0 7 5 0 0 6 4 0 0 7 7 0 7 0 0 4 2 0 6 0 1 6 0 4 7 6 3 0 5 0 0 6 6 0 3 1 0
 0 0 6 4 5 0 1 5 6 5 4 0 0 0 0 4 4 0 0 7 4 0 0 6 0 5 4 0 5 0 0 5 0 0 0 0 0
 0 4 6 1 5 0 0 3 0 4 0 0 0 4 5 7 0 0 6 4 7 4 5 0 0 0 5 5 0 0 0 0 0 0 0 5 0
 0 0 4 5 0 0 3 0 4 5 4 0 0 0 7 0 0 0 7 6 6 0 0 7 3 4 0 0 3 6 0 7 7 5 0 6 7
 4 5 0 4 5 5 1 4 4 5 5 5 4 5 0 6 4 4 0 0 0 0 0 6 7 0 4 7 1 6 1 7 6 0 4 7 0
 0 0 6 6 7 0 0 6 5 6 7 0 6 0 0 0 7 0 1 1 0 0 5 0 4 5 1 5 6 0 0 5 0 0 0 5 3
 0 6 5 5 6 0 4 5 6 0 5 6 0 0 0 0 0 0 7 6 0 7 5 0 7 0 0 0 0 0 0 5 0 5 0 6 0
 0 0 0 1 0 5 5 5 0 4 5 5 0 0 7 0 0 1 0 0 4 7 5 4 0 6 6 7 0 0 0 1 0 1 4 7 0
 7 2 0 0 0 6 0 0 0 0 5 0 0 6 0 7 0 6 1 6 6 4 6 0 5 0 0 1 0 4 0 0 0 0 4 0 7
 0 0 0 0 0 7 0 0 7 0 6 5 5 6 5 6 4 7 0 0 5 0 5 0 5 0 0 0 0 1 7 0 0 5 6 5 4
 

In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_data)
print('Clusters on test data',predict_test) 

Clusters on test data [0 0 0 1 4 0 0 6 0 0 0 7 5 5 0 0 5 7 0 0 0 0 5 0 7 0 0 4 6 4 4 0 7 0 0 0 0
 3 0 0 6 5 0 0 4 5 0 0 4 4 4 5 0 4 7 1 7 0 3 5 0 4 0 0 6 0 5 0 0 5 7 6 0 6
 0 0 0 7 5 0 0 0 4 1 0 0 1 6 0 0 0 0 5 0 4 0 0 5 4 0 7 4 0 6 0 4 0 1 5 5 5
 6 4 0 0 4 6 5 0 4 1 7 4 0 0 0 4 5 0 7 5 0 0 1 5 0 0 5 3 6 0 0 5 6 0 0 0 6
 0 6 0 6 4 6 0 0 0 5 0 0 0 0 5 0 0 5 4 0 0 0 0 7 6 0 6 0 6 6 5]


In [None]:
# Now, we will train a model with n_cluster = 3
model_n3 = KMeans(n_clusters=3)

In [None]:
# fit the model with the training data
model_n3.fit(train_data)

KMeans(n_clusters=3)

In [None]:
# Number of Clusters
print('\nNumber of Clusters : ',model_n3.n_clusters)


Number of Clusters :  3


In [None]:
# predict the clusters on the train dataset
predict_train_3 = model_n3.predict(train_data)
print('\nCLusters on train data',predict_train_3) 


CLusters on train data [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0
 0 0 0 0 0 0 2 0 0 0 0 2 0 1 0 0 0 0 2 0 0 0 0 0 1 1 0 0 2 0 0 0 0 0 0 2 0
 0 2 0 0 0 0 0 0 2 0 2 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0
 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 1 0 0 0 2 0 0 2 0 0 1 0 0 0 0 0 0 0 1 2 0
 0 0 0 2 0 0 2 0 0 0 2 0 0 0 0 2 2 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
 0 2 0 2 0 0 0 1 0 2 0 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 2 0 0 0 1 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 0 0 0 0 0 0 0 0
 2 0 0 2 0 0 2 2 2 0 0 0 2 0 0 0 2 2 0 0 0 0 0 0 0 0 2 0 2 0 2 0 0 0 2 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 2 0 0 2 0 0 0 0 0 0 0 2 0 2 2 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 2 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 2
 

In [None]:
# predict the target on the test dataset
predict_test_3 = model_n3.predict(test_data)
print('Clusters on test data',predict_test_3) 

Clusters on test data [0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 0 0 0 0 0 0
 1 0 0 0 0 0 0 2 0 0 0 2 2 2 0 0 2 0 2 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 2 2 0 0 2 0 0 0 0 0 0 0 2 0 0 0 2 0 0 2 0 0 0 2 0 2 0 0 0
 0 2 0 0 2 0 0 0 2 2 0 2 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0]


##8. Random Forest

*   Random Forest is a trademark term for an ensemble of decision trees
*   In Random Forest, we’ve collection of decision trees (so known as “Forest”).

In [None]:
# importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier  #Main
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

In [None]:
'''

Create the object of the Random Forest model
You can also add other parameters and test your code here
Some parameters are : n_estimators and max_depth
Documentation of sklearn RandomForestClassifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

'''
model = RandomForestClassifier()

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

RandomForestClassifier()

In [None]:
# number of trees used
print('Number of Trees used : ', model.n_estimators)

Number of Trees used :  100


In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 


Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0
 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0
 0 1 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 0
 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0
 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1
 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0
 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1
 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 0 1 1 0 1 1 1
 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 0
 0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
 0 

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


accuracy_score on train dataset :  0.9859550561797753


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 


Target on test data [0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 0
 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1
 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 1 0 0 0 0
 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)


accuracy_score on test dataset :  0.8044692737430168


##9. Dimensionality Reduction Algorithms

In [None]:
# importing required libraries
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train.csv')
test_data = pd.read_csv('/content/sample_data/test.csv')

In [None]:
# view the top 3 rows of the dataset
print(train_data.head(3))

   Item_Weight  ...  Outlet_Type_Supermarket Type3
0     6.800000  ...                              0
1    15.600000  ...                              0
2    12.911575  ...                              1

[3 rows x 36 columns]


In [None]:
# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)


Shape of training data : (1364, 36)

Shape of testing data : (341, 36)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
# target variable - Item_Outlet_Sales
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']


In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']

In [None]:
print('\nTraining model with {} dimensions.'.format(train_x.shape[1]))


Training model with 35 dimensions.


In [None]:
# create object of model
model = LinearRegression()

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

LinearRegression()

In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)

In [None]:
# Accuray Score on train dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)


RMSE on train dataset :  1135.8159344155245


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)

In [None]:
# Accuracy Score on test dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)


RMSE on test dataset :  1009.2517232209692


In [None]:
# create the object of the PCA (Principal Component Analysis) model
# reduce the dimensions of the data to 12
'''
You can also add other parameters and test your code here
Some parameters are : svd_solver, iterated_power
Documentation of sklearn PCA:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
'''
model_pca = PCA(n_components=12)

new_train = model_pca.fit_transform(train_x)
new_test  = model_pca.fit_transform(test_x)

In [None]:

print('\nTraining model with {} dimensions.'.format(new_train.shape[1]))


Training model with 12 dimensions.


In [None]:
# create object of model
model_new = LinearRegression()

In [None]:
# fit the model with the training data
model_new.fit(new_train,train_y)

LinearRegression()

In [None]:
# predict the target on the new train dataset
predict_train_pca = model_new.predict(new_train)

In [None]:
# Accuray Score on train dataset
rmse_train_pca = mean_squared_error(train_y,predict_train_pca)**(0.5)
print('\nRMSE on new train dataset : ', rmse_train_pca)


RMSE on new train dataset :  1159.998162929229


In [None]:
# predict the target on the new test dataset
predict_test_pca = model_new.predict(new_test)

In [None]:
# Accuracy Score on test dataset
rmse_test_pca = mean_squared_error(test_y,predict_test_pca)**(0.5)
print('\nRMSE on new test dataset : ', rmse_test_pca)


RMSE on new test dataset :  1014.4104005078585


##10. Gradient Boosting Algorithms
###10.1. GBM

In [None]:
# importing required libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier #Main
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

In [None]:
'''
Create the object of the GradientBoosting Classifier model
You can also add other parameters and test your code here
Some parameters are : learning_rate, n_estimators
Documentation of sklearn GradientBoosting Classifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
'''
model = GradientBoostingClassifier(n_estimators=100,max_depth=5)

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

GradientBoostingClassifier(max_depth=5)

In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 


Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0
 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0
 0 1 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0
 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0
 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1
 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0
 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1
 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1
 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0
 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
 0 

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


accuracy_score on train dataset :  0.9592696629213483


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 


Target on test data [0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 1 0
 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)


accuracy_score on test dataset :  0.8268156424581006


###10.2. XGBoost

In [None]:
# importing required libraries
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

In [None]:
'''
Create the object of the XGBoost model
You can also add other parameters and test your code here
Some parameters are : max_depth and n_estimators
Documentation of xgboost:

https://xgboost.readthedocs.io/en/latest/
'''
model = XGBClassifier()


In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

XGBClassifier()

In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 


Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
 1 0 0 0 1 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0
 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 0 0 1
 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0
 0 0 1 1 1 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
 0 1 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0
 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1
 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 1 1
 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0
 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
 0 

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


accuracy_score on train dataset :  0.8693820224719101


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 


Target on test data [0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)


accuracy_score on test dataset :  0.8212290502793296


###10.3. LightGBM

In [None]:
import pandas as pd
import lightgbm as lgb #Main
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data.csv')
test_data = pd.read_csv('/content/sample_data/test-data.csv')

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (712, 25)
Shape of testing data : (179, 25)


In [None]:
# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

In [None]:
'''
Create the object of the LightGBM Classifier model
You can also add other parameters and test your code here
Some parameters are : n_estimators, boosting_type
Documentation of sklearn LightGBM Classifier: 

https://lightgbm.readthedocs.io/en/latest/index.html
'''
model = lgb.LGBMClassifier()

In [None]:
# fit the model with the training data
model.fit(train_x,train_y)

LGBMClassifier()

In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 


Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0
 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0
 0 1 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0
 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 0
 0 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0
 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1
 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1
 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0
 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
 0 

In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


accuracy_score on train dataset :  0.949438202247191


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 


Target on test data [0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 1 1 1 1
 0 1 0 0 0 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)


accuracy_score on test dataset :  0.8547486033519553


###10.4. Catboost

In [None]:
pip install catboost

Collecting catboost
  Downloading catboost-1.0.3-cp37-none-manylinux1_x86_64.whl (76.3 MB)
[K     |████████████████████████████████| 76.3 MB 1.3 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.3


In [None]:
# importing required libraries
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier #Main
from sklearn.metrics import accuracy_score

In [None]:
# read the train and test dataset
train_data = pd.read_csv('/content/sample_data/train-data2.csv')
test_data = pd.read_csv('/content/sample_data/test-data2.csv')

In [None]:
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

Shape of training data : (1176, 35)
Shape of testing data : (294, 35)


In [None]:
# Now, we have used a dataset which has more categorical variables
# hr-employee attrition data where target variable is Attrition 

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Attrition'],axis=1)
train_y = train_data['Attrition']

In [None]:
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Attrition'],axis=1)
test_y = test_data['Attrition']

In [None]:
# find out the indices of categorical variables
categorical_var = np.where(train_x.dtypes != np.float)[0]
print('\nCategorical Variables indices : ',categorical_var)


Categorical Variables indices :  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33]


In [None]:
print('\n Training CatBoost Model..........')
'''
Create the object of the CatBoost Classifier model
You can also add other parameters and test your code here
Some parameters are : l2_leaf, model_size
Documentation of sklearn CatBoostClassifier: 

https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html
'''
model = CatBoostClassifier(iterations=50)


 Training CatBoost Model..........


In [None]:
# fit the model with the training data
model.fit(train_x,train_y,cat_features = categorical_var,plot=False)
print('\n Model Trainied')

Learning rate set to 0.172204
0:	learn: 0.6028609	total: 55.5ms	remaining: 2.72s
1:	learn: 0.5428621	total: 59.7ms	remaining: 1.43s
2:	learn: 0.5018961	total: 66ms	remaining: 1.03s
3:	learn: 0.4562355	total: 77.6ms	remaining: 892ms
4:	learn: 0.4337903	total: 85ms	remaining: 765ms
5:	learn: 0.4117519	total: 92.4ms	remaining: 677ms
6:	learn: 0.3941772	total: 99.5ms	remaining: 611ms
7:	learn: 0.3765734	total: 106ms	remaining: 558ms
8:	learn: 0.3663712	total: 112ms	remaining: 509ms
9:	learn: 0.3517369	total: 117ms	remaining: 469ms
10:	learn: 0.3454326	total: 123ms	remaining: 436ms
11:	learn: 0.3433598	total: 125ms	remaining: 396ms
12:	learn: 0.3334597	total: 132ms	remaining: 376ms
13:	learn: 0.3236337	total: 138ms	remaining: 355ms
14:	learn: 0.3139081	total: 144ms	remaining: 336ms
15:	learn: 0.3045607	total: 150ms	remaining: 318ms
16:	learn: 0.2955388	total: 156ms	remaining: 303ms
17:	learn: 0.2869364	total: 162ms	remaining: 288ms
18:	learn: 0.2833813	total: 168ms	remaining: 274ms
19:	lear

In [None]:
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 


Target on train data [0 0 0 ... 0 0 0]


In [None]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


accuracy_score on train dataset :  0.9141156462585034


In [None]:
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 


Target on test data [0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]


In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)


accuracy_score on test dataset :  0.8605442176870748


Reference:
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/?#h2_16