# Getting the data ready 
#### Involves 3 steps
1. Split the data into training data(x or features) and testing data(y or labels). 
2. Filling(called as imputing) the missing values. 
3. Converting the non-numerical values(objects) into numerical values(series/list, dataframe/matrix, int64, float64).

In [1]:
#step 0: import the lbraries
import pandas as pd
import sklearn 
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
print("Imported!")

Imported!


In [2]:
# step 1: Get the data 
file = pd.read_csv("../datasets/heart.csv")
file.head(15)

Unnamed: 0,age,sex,chest_pain_type,resting_bp,cholestoral,fasting_blood_sugar,restecg,max_hr,exang,oldpeak,slope,num_major_vessels,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


### Splitting the data between features and labels in variables x and y

In [3]:
# x should contain everything except the labels columns so drop the column from the table.

x = file.drop("target", axis=1) #axis = 1 is column name.

#  print the features

print(x.info())

x

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  303 non-null    int64  
 1   sex                  303 non-null    int64  
 2   chest_pain_type      303 non-null    int64  
 3   resting_bp           303 non-null    int64  
 4   cholestoral          303 non-null    int64  
 5   fasting_blood_sugar  303 non-null    int64  
 6   restecg              303 non-null    int64  
 7   max_hr               303 non-null    int64  
 8   exang                303 non-null    int64  
 9   oldpeak              303 non-null    float64
 10  slope                303 non-null    int64  
 11  num_major_vessels    303 non-null    int64  
 12  thal                 303 non-null    int64  
dtypes: float64(1), int64(12)
memory usage: 30.9 KB
None


Unnamed: 0,age,sex,chest_pain_type,resting_bp,cholestoral,fasting_blood_sugar,restecg,max_hr,exang,oldpeak,slope,num_major_vessels,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [4]:
y = file["target"]
print(y.info())
y

<class 'pandas.core.series.Series'>
RangeIndex: 303 entries, 0 to 302
Series name: target
Non-Null Count  Dtype
--------------  -----
303 non-null    int64
dtypes: int64(1)
memory usage: 2.5 KB
None


0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

### Now Split the data using sklearn
 we just have manually created two different data sets for training and testing purposes but it is also important to make the machine learning model know the data, this can be done using sklearn library.

In [5]:
from sklearn.model_selection import train_test_split
#train_test_split -> splits the data into 4 variables
x_train, x_test, y_train, y_test =  train_test_split(x,y,test_size=0.2)

# x = feature data
# y = label data
# test_size =  size/fraction of the testing data(0.2) or 20% of data


In [6]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [7]:
x_train

Unnamed: 0,age,sex,chest_pain_type,resting_bp,cholestoral,fasting_blood_sugar,restecg,max_hr,exang,oldpeak,slope,num_major_vessels,thal
123,54,0,2,108,267,0,0,167,0,0.0,2,0,2
207,60,0,0,150,258,0,0,157,0,2.6,1,2,3
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
153,66,0,2,146,278,0,0,152,0,0.0,1,1,2
184,50,1,0,150,243,0,0,128,0,2.6,1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
183,58,1,2,112,230,0,0,165,0,2.5,1,1,3
25,71,0,1,160,302,0,1,162,0,0.4,2,2,2
32,44,1,1,130,219,0,0,188,0,0.0,2,0,2
92,52,1,2,138,223,0,1,169,0,0.0,2,4,2


In [8]:
y_train

123    1
207    0
1      1
153    1
184    0
      ..
183    0
25     1
32     1
92     1
289    0
Name: target, Length: 242, dtype: int64

In [9]:
x_test

Unnamed: 0,age,sex,chest_pain_type,resting_bp,cholestoral,fasting_blood_sugar,restecg,max_hr,exang,oldpeak,slope,num_major_vessels,thal
275,52,1,0,125,212,0,1,168,0,1.0,2,2,3
144,76,0,2,140,197,0,2,116,0,1.1,1,0,2
161,55,0,1,132,342,0,1,166,0,1.2,2,0,2
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
120,64,0,0,130,303,0,1,122,0,2.0,1,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,61,1,0,138,166,0,0,125,1,3.6,1,1,2
285,46,1,0,140,311,0,1,120,1,1.8,1,2,3
55,52,1,1,134,201,0,1,158,0,0.8,2,1,2
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1


In [10]:
y_test

275    0
144    1
161    1
300    0
120    1
      ..
279    0
285    0
55     1
5      1
37     1
Name: target, Length: 61, dtype: int64

# 2.1 Cleaning the data

Now getting the data ready has 3  parts in it.

what we have done now is just split the data vertically into 2 halves (features and the labels) and we have divided it into 2 more halves horizontally, the data now contains feature training, feature testing, label training, and label testing data. but the data that is sent for training is not usually so readily available for splitting into these 4 quadrants, so let's come back 1 step and see what data we have, actually.

the data we worked on was having complete filled rows, complete numerical data, and complete useful data. But sometimes, the data does not have these 3 qualities(filled rows, all numbers, useful columns).

So getting data ready, contains these 3 steps before splitting it.

1. **Cleaning the data:** To remove the empty cells or rows, by just dropping or filling it. making the data continuous is very important.
2. **Transforming the data:** To transform the data means to convert strings and objects to numerical data. example: colurs "red" is coverted to RGB value(1,0,0) etc.
3. **Reducing the data:** To reduce the data means to remove the columns such as (are you human?--> all rows yes) should be removed as there is no use for such kind of data. also meaning data is used to make the data laodaing and working cost cheaper as this requires a lot of CPU.

In [11]:
file = pd.read_csv("../datasets/Car_sales_missing.csv")
file = file.drop("Latest_Launch",axis = 1)
file

Unnamed: 0,Manufacturer,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor
0,Acura,16.919,16.360,Passenger,21.50,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,58.280150
1,Acura,39.384,19.875,Passenger,28.40,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,91.370778
2,Acura,14.114,18.225,Passenger,,3.2,225.0,106.9,70.6,192.0,3.470,17.2,26.0,
3,Acura,8.588,29.725,Passenger,42.00,3.5,210.0,114.6,71.4,196.6,3.850,18.0,22.0,91.389779
4,Audi,20.397,22.255,Passenger,23.99,1.8,,102.6,68.2,178.0,2.998,16.4,27.0,62.777639
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,Volvo,3.545,,Passenger,24.40,1.9,160.0,100.5,67.6,176.6,3.042,15.8,25.0,66.498812
153,Volvo,15.245,,Passenger,27.50,,168.0,104.9,69.3,185.9,3.208,17.9,25.0,70.654495
154,Volvo,17.531,,Passenger,28.80,2.4,168.0,104.9,69.3,186.2,3.259,17.9,25.0,71.155978
155,Volvo,3.493,,Passenger,45.50,2.3,236.0,104.9,71.5,185.7,3.601,18.5,23.0,


In [12]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Manufacturer         157 non-null    object 
 1   Sales_in_thousands   155 non-null    float64
 2   __year_resale_value  117 non-null    float64
 3   Vehicle_type         156 non-null    object 
 4   Price_in_thousands   150 non-null    float64
 5   Engine_size          153 non-null    float64
 6   Horsepower           153 non-null    float64
 7   Wheelbase            153 non-null    float64
 8   Width                154 non-null    float64
 9   Length               154 non-null    float64
 10  Curb_weight          152 non-null    float64
 11  Fuel_capacity        155 non-null    float64
 12  Fuel_efficiency      154 non-null    float64
 13  Power_perf_factor    144 non-null    float64
dtypes: float64(12), object(2)
memory usage: 17.3+ KB


In [13]:
# first step is to remove the null values for now...
file = file.dropna()

In [14]:
file.info()

<class 'pandas.core.frame.DataFrame'>
Index: 90 entries, 0 to 147
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Manufacturer         90 non-null     object 
 1   Sales_in_thousands   90 non-null     float64
 2   __year_resale_value  90 non-null     float64
 3   Vehicle_type         90 non-null     object 
 4   Price_in_thousands   90 non-null     float64
 5   Engine_size          90 non-null     float64
 6   Horsepower           90 non-null     float64
 7   Wheelbase            90 non-null     float64
 8   Width                90 non-null     float64
 9   Length               90 non-null     float64
 10  Curb_weight          90 non-null     float64
 11  Fuel_capacity        90 non-null     float64
 12  Fuel_efficiency      90 non-null     float64
 13  Power_perf_factor    90 non-null     float64
dtypes: float64(12), object(2)
memory usage: 10.5+ KB


In [15]:
# Start splitting the data
# vertical split -> feature and label

x = file.drop("Price_in_thousands",axis=1)
y = file["Price_in_thousands"]

In [16]:
#horizontal split

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

In [17]:
x

Unnamed: 0,Manufacturer,Sales_in_thousands,__year_resale_value,Vehicle_type,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor
0,Acura,16.919,16.360,Passenger,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,58.280150
1,Acura,39.384,19.875,Passenger,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,91.370778
3,Acura,8.588,29.725,Passenger,3.5,210.0,114.6,71.4,196.6,3.850,18.0,22.0,91.389779
5,Audi,18.780,23.555,Passenger,2.8,200.0,108.7,76.1,192.0,3.561,18.5,22.0,84.565105
6,Audi,1.380,39.000,Passenger,4.2,310.0,113.0,74.0,198.2,3.902,23.7,21.0,134.656858
...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,Toyota,68.411,19.425,Car,2.7,150.0,105.3,66.5,183.3,3.440,18.5,23.0,62.355577
144,Toyota,9.835,34.080,Car,4.7,230.0,112.2,76.4,192.5,5.115,25.4,15.0,102.528984
145,Volkswagen,9.761,11.425,Passenger,2.0,115.0,98.9,68.3,163.3,2.767,14.5,26.0,46.943877
146,Volkswagen,83.721,13.240,Passenger,2.0,115.0,98.9,68.3,172.3,2.853,14.5,26.0,47.638237


In [18]:
y

0      21.500
1      28.400
3      42.000
5      33.950
6      62.000
        ...  
143    22.288
144    51.728
145    14.900
146    16.700
147    21.200
Name: Price_in_thousands, Length: 90, dtype: float64

In [19]:
x_train

Unnamed: 0,Manufacturer,Sales_in_thousands,__year_resale_value,Vehicle_type,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor
48,Ford,35.068,8.835,Passenger,2.5,170.0,106.5,69.1,184.6,2.769,15.0,25.0,67.351011
112,Oldsmobile,20.017,19.925,Car,4.3,190.0,107.0,67.8,181.2,4.068,17.5,19.0,80.511673
24,Chevrolet,17.947,36.225,Passenger,5.7,345.0,104.5,73.6,179.7,3.210,19.1,22.0,141.141150
1,Acura,39.384,19.875,Passenger,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,91.370778
6,Audi,1.380,39.000,Passenger,4.2,310.0,113.0,74.0,198.2,3.902,23.7,21.0,134.656858
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,Chevrolet,32.299,9.125,Passenger,1.8,120.0,97.1,66.7,174.3,2.398,13.2,33.0,48.297636
41,Dodge,16.767,15.510,Car,3.9,175.0,109.6,78.8,192.6,4.245,32.0,15.0,71.135292
131,Saturn,5.223,10.790,Passenger,1.9,124.0,102.4,66.4,176.9,2.452,12.1,31.0,49.865774
94,Mercedes-B,16.774,50.375,Passenger,4.3,275.0,121.5,73.1,203.1,4.133,23.2,21.0,125.273876


In [20]:
x_test

Unnamed: 0,Manufacturer,Sales_in_thousands,__year_resale_value,Vehicle_type,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor
136,Toyota,142.535,10.025,Passenger,1.8,120.0,97.0,66.7,174.0,2.42,13.2,33.0,47.968972
113,Oldsmobile,24.361,15.24,Car,3.4,185.0,120.0,72.2,201.4,3.948,25.0,22.0,76.09657
37,Dodge,71.186,10.185,Passenger,2.5,168.0,108.0,71.0,186.0,3.058,16.0,24.0,67.876108
62,Hyundai,41.184,5.86,Passenger,1.5,92.0,96.1,65.7,166.7,2.24,11.9,31.0,36.672284
55,Ford,220.65,7.85,Car,2.5,119.0,117.5,69.4,200.7,3.086,20.0,23.0,47.389531
3,Acura,8.588,29.725,Passenger,3.5,210.0,114.6,71.4,196.6,3.85,18.0,22.0,91.389779
22,Chevrolet,42.593,11.525,Passenger,3.4,180.0,110.5,72.7,197.9,3.34,17.0,27.0,72.030917
45,Dodge,181.749,12.025,Car,2.4,150.0,113.3,76.8,186.3,3.533,20.0,24.0,61.227
101,Nissan,42.643,8.45,Passenger,1.8,126.0,99.8,67.3,177.5,2.593,13.2,30.0,50.241978
51,Ford,63.403,14.21,Passenger,4.6,200.0,114.7,78.2,212.0,3.908,19.0,21.0,80.499537


In [21]:
y_train

48     17.035
112    31.598
24     45.705
1      28.400
6      62.000
        ...  
25     13.960
41     21.315
131    14.290
94     69.700
81     17.357
Name: Price_in_thousands, Length: 72, dtype: float64

In [22]:
y_test

136    13.108
113    25.345
37     20.230
62      9.699
55     12.050
3      42.000
22     19.390
45     19.565
101    13.499
51     22.195
88     19.035
95     82.600
47     21.560
61     26.000
67     14.460
137    17.518
82     24.997
73     54.005
Name: Price_in_thousands, dtype: float64

In [1]:
# now we need to predict a model that is giving back an integer hence we use random forest.

from sklearn.ensemble import RandomForestRegressor


model = RandomForestRegressor() #store the model in the variable
# model.fit(x_train, y_train) #train the model on features and labels.
# model.score(x_test, y_test) #test the model on testing data

# ERROR WILL GIVE A MAJOR ERROR --> THE STRINGS CANNOT BE CONVERTED TO THE FLOAT VALUES.
# NOW WHAT !!!!!



# Converting the string and other objects to Numerical Data.

Here is a step-by-step explanation of what the code does:

Imports the necessary libraries: `sklearn.preprocessing`ng an`d sklearn.compo`se.
Defines a list of categorical features: ["Manufacturer","Vehicle_type"].
Create`s a OneHotEncoder ob`je`ct: one`_hot.
Creates a ColumnTransformer object: transformer. The ColumnTransformer object takes a list of tuples as input. Each tuple specifies how a particular set of columns in the dataset should be transformed. In this case, the first tuple specifies that the Manufacturer and Vehicle_type columns should be one-hot encoded using the one_hot object. The remainder keyword argument specifies that all other columns in the dataset should be passed through without any transformation.
Fits the ColumnTransformer object to the `data: transformer.fit_transf`
rm(x). This creates a new dataset where the Manufacturer and Vehicle_type columns have been one-hot encoded.
Stores the transformed dataset in the variable transfromed_x.

In [24]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Manufacturer","Vehicle_type"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],
                               remainder = "passthrough")

transfromed_x = transformer.fit_transform(x)
transfromed_x


array([[ 1.        ,  0.        ,  0.        , ..., 13.2       ,
        28.        , 58.28014952],
       [ 1.        ,  0.        ,  0.        , ..., 17.2       ,
        25.        , 91.37077766],
       [ 1.        ,  0.        ,  0.        , ..., 18.        ,
        22.        , 91.38977933],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., 14.5       ,
        26.        , 46.94387676],
       [ 0.        ,  0.        ,  0.        , ..., 14.5       ,
        26.        , 47.63823666],
       [ 0.        ,  0.        ,  0.        , ..., 16.4       ,
        27.        , 61.70138136]])

In [25]:
file = pd.DataFrame(transfromed_x)
file.to_csv("../datasets/transformed_cars.csv",index=False)
print("Done")

Done


In [26]:
#get dummy function was applied here 

In [27]:
#refitting the data in the model

x


Unnamed: 0,Manufacturer,Sales_in_thousands,__year_resale_value,Vehicle_type,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor
0,Acura,16.919,16.360,Passenger,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,58.280150
1,Acura,39.384,19.875,Passenger,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,91.370778
3,Acura,8.588,29.725,Passenger,3.5,210.0,114.6,71.4,196.6,3.850,18.0,22.0,91.389779
5,Audi,18.780,23.555,Passenger,2.8,200.0,108.7,76.1,192.0,3.561,18.5,22.0,84.565105
6,Audi,1.380,39.000,Passenger,4.2,310.0,113.0,74.0,198.2,3.902,23.7,21.0,134.656858
...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,Toyota,68.411,19.425,Car,2.7,150.0,105.3,66.5,183.3,3.440,18.5,23.0,62.355577
144,Toyota,9.835,34.080,Car,4.7,230.0,112.2,76.4,192.5,5.115,25.4,15.0,102.528984
145,Volkswagen,9.761,11.425,Passenger,2.0,115.0,98.9,68.3,163.3,2.767,14.5,26.0,46.943877
146,Volkswagen,83.721,13.240,Passenger,2.0,115.0,98.9,68.3,172.3,2.853,14.5,26.0,47.638237


In [28]:
file = pd.read_csv("../datasets/transformed_cars.csv")
file.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       90 non-null     float64
 1   1       90 non-null     float64
 2   2       90 non-null     float64
 3   3       90 non-null     float64
 4   4       90 non-null     float64
 5   5       90 non-null     float64
 6   6       90 non-null     float64
 7   7       90 non-null     float64
 8   8       90 non-null     float64
 9   9       90 non-null     float64
 10  10      90 non-null     float64
 11  11      90 non-null     float64
 12  12      90 non-null     float64
 13  13      90 non-null     float64
 14  14      90 non-null     float64
 15  15      90 non-null     float64
 16  16      90 non-null     float64
 17  17      90 non-null     float64
 18  18      90 non-null     float64
 19  19      90 non-null     float64
 20  20      90 non-null     float64
 21  21      90 non-null     float64
 22  22  

In [29]:
print("hello world")

hello world


In [30]:
# refit the model 
np.random.seed(42)
x_train, x_test, y_train, y_test = train_test_split(transfromed_x,y,test_size=0.2)
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.9299074260342777

In [31]:
print(x_train)


[[  0.           0.           0.         ...  19.          21.
   93.9579169 ]
 [  0.           0.           0.         ...  21.1         20.
  139.9822936 ]
 [  0.           0.           0.         ...  20.          24.
   60.95118512]
 ...
 [  0.           0.           0.         ...  12.5         29.
   52.08489875]
 [  0.           0.           0.         ...  19.1         22.
  141.14115   ]
 [  0.           0.           0.         ...  16.3         25.
   58.60677292]]


In [32]:
len(x_train)

72

In [33]:
from sklearn import naive_bayes

x_train, x_test, y_train, y_test = train_test_split(transfromed_x,y,test_size=0.2)
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.4653073233745887

In [35]:
from sklearn import svm
x_train, x_test, y_train, y_test = train_test_split(transfromed_x,y,test_size=0.2)
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.9726617844826526

svm is better than naive bayes