<a href="https://colab.research.google.com/github/yabbou/python-data-science/blob/main/project_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dimensionality Reduction & Feature Selection Project**
==

Introduction
==

**Research Question:** _To what extent can the independent variables predict the **average miles/gallon (MPG)** for a city car?_

**Variables:** The original data set includes 10 categorical variables and 16 numerical variables. There are 25 variables for independent variables. The `city-mpg` variable will serve as the dependent variable for our regression model.

**Summary:**


Step 1: Introduction

Step 2: Exploratory Data Analysis

Step 3: Data Cleaning

Step 4: Feature Engineering and Feature Selection

Step 5: Multiple Linear Regression

Step 6: Conclusion

Part 2: EDA
==

Detailed here: https://archive.ics.uci.edu/ml/datasets

1. **symboling**: the degree which is more risky than its price indicates. It ranges from -3 (safer) to 3 (more risky). 
2. **normalized-losses**: as compared to other cars. It is the relative, average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc.), and represents the average loss per car per year. In this case, the range of normalized_losses is from 65 to 256.

3. **make**: The makers of the brand. In this dataset-- alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo

4. **fuel-type**: diesel, gas.

5. **aspiration**: std, turbo.
6. **num-of-doors**: four, two.
7. **body-style**: hardtop, wagon, sedan, hatchback, convertible.
8. **drive-wheels**: 4wd, fwd, rwd.
9. **engine-location**: front, rear.
10. **wheel-base**:the distance between the centers of the front and rear wheels (continuous from 86.6 120.9).

11. **length**: continuous from 141.1 to 208.1.
12. **width**: continuous from 60.3 to 72.3.
13. **height**: continuous from 47.8 to 59.8.
14. **curb-weight**: the weight of an automobile without occupants or baggage (continuous from 1488 to 4066).
15. **engine-type**: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
16. **num-of-cylinders**: eight, five, four, six, three, twelve, two.
17. **engine-size**: continuous from 61 to 326.
18. **fuel-system**: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
19. **bore**: size of cylinder for each car (continuous from 2.54 to 3.94).
20. **stroke**: the distance travelled by the piston in each cycle, continuous from 2.07 to 4.17.
21. **compression-ratio**: the ratio of the maximum to minimum volume in the cylinder of an internal combustion engine (continuous from 7 to 23).
22. **horsepower**: engine's power of each car (continuous from 48 to 288).
23. **peak-rpm**: the maximum revolutions per minute (continuous from 4150 to 6600).
24. **city-mpg**: continuous from 13 to 49. **THE TARGET VARIABLE**
25. **highway-mpg**: continuous from 16 to 54.
26. **price**: continuous from 5118 to 45400.

Here is a quick look of the data:

In [97]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

cars_df = pd.read_csv("https://raw.githubusercontent.com/MatthewFried/Udemy/master/Day3/Day3%20Data.csv")
cars_df.head()

Unnamed: 0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,13495
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
1,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
2,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
3,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
4,2,?,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250


Our data appears to have been gathered by less-than-professional data records. One column is unknown, others column names are numbers, others are shortened forms of likely common technical car terms. 

Let's first clarify the column names, based on the website data.

In [98]:
COLUMNS=['Symboling', 'Normalized_Losses', 'Make', 'Fuel_Type','Aspiration','Num_Doors','Body_Style', 'Drive_Wheels','Engine_Location','Wheel_Base', 'Length','Width','Height','Curb_Weight','Engine_Type','Num_Cylinders','Engine_Size','Fuel_System','Bore','Stroke','Compression_Ratio', 'Horsepower','Peak_RPM','City_MPG','Highway_MPG','Price']
cars_df = pd.read_csv("https://raw.githubusercontent.com/MatthewFried/Udemy/master/Day3/Day3%20Data.csv", names=COLUMNS)
cars_df.columns

Index(['Symboling', 'Normalized_Losses', 'Make', 'Fuel_Type', 'Aspiration',
       'Num_Doors', 'Body_Style', 'Drive_Wheels', 'Engine_Location',
       'Wheel_Base', 'Length', 'Width', 'Height', 'Curb_Weight', 'Engine_Type',
       'Num_Cylinders', 'Engine_Size', 'Fuel_System', 'Bore', 'Stroke',
       'Compression_Ratio', 'Horsepower', 'Peak_RPM', 'City_MPG',
       'Highway_MPG', 'Price'],
      dtype='object')

Let's see some statistics for the numeric variables _without unknown values_:

In [99]:
cars_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Symboling,205.0,0.834146,1.245307,-2.0,0.0,1.0,2.0,3.0
Wheel_Base,205.0,98.756585,6.021776,86.6,94.5,97.0,102.4,120.9
Length,205.0,174.049268,12.337289,141.1,166.3,173.2,183.1,208.1
Width,205.0,65.907805,2.145204,60.3,64.1,65.5,66.9,72.3
Height,205.0,53.724878,2.443522,47.8,52.0,54.1,55.5,59.8
Curb_Weight,205.0,2555.565854,520.680204,1488.0,2145.0,2414.0,2935.0,4066.0
Engine_Size,205.0,126.907317,41.642693,61.0,97.0,120.0,141.0,326.0
Compression_Ratio,205.0,10.142537,3.97204,7.0,8.6,9.0,9.4,23.0
City_MPG,205.0,25.219512,6.542142,13.0,19.0,24.0,30.0,49.0
Highway_MPG,205.0,30.75122,6.886443,16.0,25.0,30.0,34.0,54.0


What are the unique values in our data? Any out of the ordinary?

In [100]:
def displayUnique():
  COLS = cars_df.columns
  for x in range(COLS.shape[0]):
    print(COLS[x])
    print(np.unique(cars_df.iloc[:,x].sort_values(ascending=False)),'\n')
displayUnique()

Symboling
[-2 -1  0  1  2  3] 

Normalized_Losses
['101' '102' '103' '104' '106' '107' '108' '110' '113' '115' '118' '119'
 '121' '122' '125' '128' '129' '134' '137' '142' '145' '148' '150' '153'
 '154' '158' '161' '164' '168' '186' '188' '192' '194' '197' '231' '256'
 '65' '74' '77' '78' '81' '83' '85' '87' '89' '90' '91' '93' '94' '95'
 '98' '?'] 

Make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
 'volvo'] 

Fuel_Type
['diesel' 'gas'] 

Aspiration
['std' 'turbo'] 

Num_Doors
['?' 'four' 'two'] 

Body_Style
['convertible' 'hardtop' 'hatchback' 'sedan' 'wagon'] 

Drive_Wheels
['4wd' 'fwd' 'rwd'] 

Engine_Location
['front' 'rear'] 

Wheel_Base
[ 86.6  88.4  88.6  89.5  91.3  93.   93.1  93.3  93.7  94.3  94.5  95.1
  95.3  95.7  95.9  96.   96.1  96.3  96.5  96.6  96.9  97.   97.2  97.3
  98.4  98.8  99.1  99.2  99.4  99.5  

Some numeric columns store their numbers as strings. None appear null. After we replace the unknown rows with valid values, we could convert those columns to hold integers.

Part 3: Data Cleaning
==

Nulls?

In [101]:
cars_df = cars_df.replace('?',np.NAN)
null_count = cars_df.isnull().sum()

print('Columns with null values:', [col for col in cars_df if cars_df.loc[:,col].isnull().sum()>0],'\n')
null_count = null_count[null_count>0]

print(null_count[null_count >0],'\n')
print(null_count/len(cars_df) *100)

Columns with null values: ['Normalized_Losses', 'Num_Doors', 'Bore', 'Stroke', 'Horsepower', 'Peak_RPM', 'Price'] 

Normalized_Losses    41
Num_Doors             2
Bore                  4
Stroke                4
Horsepower            2
Peak_RPM              2
Price                 4
dtype: int64 

Normalized_Losses    20.00000
Num_Doors             0.97561
Bore                  1.95122
Stroke                1.95122
Horsepower            0.97561
Peak_RPM              0.97561
Price                 1.95122
dtype: float64


Thankfully, the unknown values appear to correlate (besides for the `Normalized Losses`).

Before null column values are replaced, let's seperate the numeric columns from the categorical.

In [105]:
def calcAndDisplayColumnTypes():  
  numeric_columns = []
  categorical_columns = []

  for i in cars_df.columns[:]:
    if(cars_df[i].dtype=='object'):
      categorical_columns.append(i)
    else:
      numeric_columns.append(i)
       
  print(len(numeric_columns),'Numeric variables:',numeric_columns)
  print(len(categorical_columns),'Categorical variables:',categorical_columns)

calcAndDisplayColumnTypes()

10 Numeric variables: ['Symboling', 'Wheel_Base', 'Length', 'Width', 'Height', 'Curb_Weight', 'Engine_Size', 'Compression_Ratio', 'City_MPG', 'Highway_MPG']
16 Categorical variables: ['Normalized_Losses', 'Make', 'Fuel_Type', 'Aspiration', 'Num_Doors', 'Body_Style', 'Drive_Wheels', 'Engine_Location', 'Engine_Type', 'Num_Cylinders', 'Fuel_System', 'Bore', 'Stroke', 'Horsepower', 'Peak_RPM', 'Price']


Some numeric columns are currently type `object`. We will change them to `float`.

In [106]:
SHOULD_BE_NUMERIC_COLS = ['Bore', 'Stroke','Normalized_Losses', 'Horsepower','Price','Peak_RPM']
cars_df[SHOULD_BE_NUMERIC_COLS] = cars_df[SHOULD_BE_NUMERIC_COLS].astype('float')

calcAndDisplayColumnTypes()

16 Numeric variables: ['Symboling', 'Normalized_Losses', 'Wheel_Base', 'Length', 'Width', 'Height', 'Curb_Weight', 'Engine_Size', 'Bore', 'Stroke', 'Compression_Ratio', 'Horsepower', 'Peak_RPM', 'City_MPG', 'Highway_MPG', 'Price']
10 Categorical variables: ['Make', 'Fuel_Type', 'Aspiration', 'Num_Doors', 'Body_Style', 'Drive_Wheels', 'Engine_Location', 'Engine_Type', 'Num_Cylinders', 'Fuel_System']


Imputation appears necessary to change some of the numeric columns from type string to float. Let's try that using the 'mean' of each column.

In [107]:
from sklearn.impute import SimpleImputer
# cars_df['Num_Doors'] = cars_df['Num_Doors'].replace('two',2).replace('four',4) 
numeric_columns_hardcoded = ['Symboling', 'Normalized_Losses', 'Wheel_Base', 'Length', 'Width', 'Height', 'Curb_Weight', 'Engine_Size', 'Bore', 'Stroke', 'Compression_Ratio', 'Horsepower', 'Peak_RPM', 'City_MPG', 'Highway_MPG', 'Price']

values = cars_df[numeric_columns_hardcoded].values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
transformed_values = imputer.fit_transform(values)
# transformed_values 

cars_without_nulls_df = pd.DataFrame(transformed_values)
cars_without_nulls_df.columns = numeric_columns_hardcoded
cars_without_nulls_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Symboling,205.0,0.834146,1.245307,-2.0,0.0,1.0,2.0,3.0
Normalized_Losses,205.0,122.0,31.681008,65.0,101.0,122.0,137.0,256.0
Wheel_Base,205.0,98.756585,6.021776,86.6,94.5,97.0,102.4,120.9
Length,205.0,174.049268,12.337289,141.1,166.3,173.2,183.1,208.1
Width,205.0,65.907805,2.145204,60.3,64.1,65.5,66.9,72.3
Height,205.0,53.724878,2.443522,47.8,52.0,54.1,55.5,59.8
Curb_Weight,205.0,2555.565854,520.680204,1488.0,2145.0,2414.0,2935.0,4066.0
Engine_Size,205.0,126.907317,41.642693,61.0,97.0,120.0,141.0,326.0
Bore,205.0,3.329751,0.270844,2.54,3.15,3.31,3.58,3.94
Stroke,205.0,3.255423,0.313597,2.07,3.11,3.29,3.41,4.17


In [108]:
displayUnique()

Symboling
[-2 -1  0  1  2  3] 

Normalized_Losses
[ 65.  74.  77.  78.  81.  83.  85.  87.  89.  90.  91.  93.  94.  95.
  98. 101. 102. 103. 104. 106. 107. 108. 110. 113. 115. 118. 119. 121.
 122. 125. 128. 129. 134. 137. 142. 145. 148. 150. 153. 154. 158. 161.
 164. 168. 186. 188. 192. 194. 197. 231. 256.  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan] 

Make
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
 'volvo'] 

Fuel_Type
['diesel' 'gas'] 

Aspiration
['std' 'turbo'] 

Num_Doors


TypeError: ignored

Let us visualise the column data distributions:

In [None]:
nd = pd.melt(cars_df[numerical_columns])
numeric = sns.FacetGrid(nd, col='variable', col_wrap=5, sharex=False, sharey = False)
numeric.map(sns.distplot, 'value')

Most attributes _generally_ fit normal distribution besides `compression_ratio` (compression ratio of 10 is very common, as seen in `.detailed()` above). Many are slightly skewed. 

Here are the categorical column distributions:

In [None]:
fig, ax = plt.subplots(4, 4, figsize=(30, 30))
for variable, subplot in zip(categorical_columns, ax.flatten()):
    sns.countplot(cars_df[variable], ax=subplot) #conut plot, bc categorical data is not continuous
    for label in subplot.get_xticklabels():
        label.set_rotation(90)

Let's visualize some of the column correlations.

In [None]:
categorical_columns = []
FIRST_NON_NULL_ROW = 3
for col in cars_df:
  column_type = type(cars_df.loc[FIRST_NON_NULL_ROW, col])
  if column_type == str:
    categorical_columns.append(col)
print(categorical_columns)

fig = plt.figure(figsize=(16, 12)) 
corr = cars_df.copy().corr() 
sns.heatmap(corr, annot=True)

b, t = plt.ylim()
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
fig.suptitle('Numeric Correlation Heatmap') #make acc
plt.show()

Part 4: Feature Selection and Dimension Reduction
==

Validation
==