# DSO106 ML L3 Hands On

## Import Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.cluster import KMeans

## Load in Data

In [2]:
Mpg = sns.load_dataset('mpg')
Mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


## Data Wrangling

### Drop the string columns

In [3]:
MpgTrimmed = Mpg.drop(['origin', 'name'], axis=1)
MpgTrimmed.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
0,18.0,8,307.0,130.0,3504,12.0,70
1,15.0,8,350.0,165.0,3693,11.5,70
2,18.0,8,318.0,150.0,3436,11.0,70
3,16.0,8,304.0,150.0,3433,12.0,70
4,17.0,8,302.0,140.0,3449,10.5,70


### Drop Missing Values

In [4]:
MpgTrimmed.dropna(inplace=True)

### Convert Floats to Integers

In [5]:
MpgTrimmed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   model_year    392 non-null    int64  
dtypes: float64(4), int64(3)
memory usage: 24.5 KB


*Looks like mpg, displacement, horsepower, accelatrion are all floats, so they will need to be converted to integers*

In [6]:
MpgTrimmed.mpg = MpgTrimmed.mpg.astype(int)

In [7]:
MpgTrimmed.displacement = MpgTrimmed.displacement.astype(int)
MpgTrimmed.horsepower = MpgTrimmed.horsepower.astype(int)
MpgTrimmed.acceleration = MpgTrimmed.acceleration.astype(int)

## Perform k-Means Clustering

### Testing 2 Clusters

In [8]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(MpgTrimmed)

KMeans(n_clusters=2)

In [9]:
MpgTrimmed['Group'] = kmeans.labels_

In [10]:
MpgTrimmed.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,Group
0,18,8,307,130,3504,12,70,1
1,15,8,350,165,3693,11,70,1
2,18,8,318,150,3436,11,70,1
3,16,8,304,150,3433,12,70,1
4,17,8,302,140,3449,10,70,1


In [11]:
MpgTrimmed.groupby('Group').mean()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,27.889831,4.305085,123.521186,82.59322,2381.381356,15.813559,76.783898
1,16.314103,7.237179,301.653846,137.564103,3879.532051,14.237179,74.762821


*If you use two clusters, it looks like the first cluster contains cars that go slower, have more cylinders on average, have greater displacement, more horsepower, are heavier, accelerate slower, and are slightly older.  In summation: older trucks perhaps*

### Testing 3 Clusters

In [12]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(MpgTrimmed)

KMeans(n_clusters=3)

In [13]:
MpgTrimmed['Group'] = kmeans.labels_

In [14]:
MpgTrimmed.groupby('Group').mean()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,14.533333,7.866667,344.144444,157.811111,4236.322222,13.2,74.011111
1,29.483333,4.038889,107.205556,77.166667,2222.827778,15.955556,76.711111
2,20.590164,5.819672,212.614754,105.401639,3162.581967,15.516393,76.352459


*Ok, now with three clusters, it looks like you have a group that gets better mpg and has fewer cylinders, is low on horsepower and is new and light. Probably little sedans.*

*Then you have the original big, heavy, and slow group.* 

*The third one seems to be midrange cars*

### Testing 4 Clusters

In [15]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(MpgTrimmed)

KMeans(n_clusters=4)

In [16]:
MpgTrimmed['Group'] = kmeans.labels_

In [17]:
MpgTrimmed.groupby('Group').mean()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,13.898551,8.0,356.536232,165.130435,4366.594203,12.782609,73.608696
1,24.418367,4.704082,154.346939,94.295918,2746.438776,15.326531,77.112245
2,18.179775,6.640449,259.966292,116.808989,3484.483146,15.337079,75.58427
3,30.566176,3.977941,98.125,72.948529,2107.705882,16.205882,76.625


*Adding a fourth group in means that it looks like Group 2 becomes the oldest, heaviest, slowest group yet!*

*Can probably stop now, since there doesn't seem to be a lot of differention between groups. It is subjective, but looks like three clusters may be optimal.*