# Study Unit 5 Data Analytics in Python

## scikit-learn

* Simple and efficient tools for **predictive data analysis**
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

## Algorithms in scikit-learn

1. Supervised learning: learn with labled data
2. Unsupervised learning: learn with unlabled data

## Install and import scikit-learn

Importing whole sklearn takes longer time. Usually we only import packages we need.

In [None]:
!pip install scikit-learn

In [None]:
# import sklearn

from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

## Import

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
from sklearn import tree
from sklearn.cluster import KMeans
from sklearn import decomposition 

# Activity: Car

In [25]:
import pandas as pd

In [26]:
car = pd.read_csv('car_model.csv', na_values='na_string', na_filter=True)## Page 23: Activity
display(car)

Unnamed: 0,Year,Make,Model,Category
0,2021,Acura,ILX,Sedan
1,2021,Acura,RDX,SUV
2,2021,Acura,TLX,Sedan
3,2021,Alfa Romeo,Giulia,Sedan
4,2021,Alfa Romeo,Stelvio,SUV
...,...,...,...,...
235,2021,Volvo,V60,Wagon
236,2021,Volkswagen,Tiguan,SUV
237,2021,Volvo,XC40,SUV
238,2021,Volvo,XC60,SUV


In [27]:
new_car_df = car.drop(columns=['Model'])

# rename Make to Model
# new_car_df(columns={'Make': 'Model'})

new_car_df.head()

Unnamed: 0,Year,Make,Category
0,2021,Acura,Sedan
1,2021,Acura,SUV
2,2021,Acura,Sedan
3,2021,Alfa Romeo,Sedan
4,2021,Alfa Romeo,SUV


In [28]:
car  # original

Unnamed: 0,Year,Make,Model,Category
0,2021,Acura,ILX,Sedan
1,2021,Acura,RDX,SUV
2,2021,Acura,TLX,Sedan
3,2021,Alfa Romeo,Giulia,Sedan
4,2021,Alfa Romeo,Stelvio,SUV
...,...,...,...,...
235,2021,Volvo,V60,Wagon
236,2021,Volkswagen,Tiguan,SUV
237,2021,Volvo,XC40,SUV
238,2021,Volvo,XC60,SUV


### Extract the dependent and independent variables

In [29]:
# Method 1
X = new_car_df.drop('Category', axis=1)
y = new_car_df['Category']

In [30]:
X.head()

Unnamed: 0,Year,Make
0,2021,Acura
1,2021,Acura
2,2021,Acura
3,2021,Alfa Romeo
4,2021,Alfa Romeo


In [31]:
y.head()

0    Sedan
1      SUV
2    Sedan
3    Sedan
4      SUV
Name: Category, dtype: object

In [32]:
# Method 2
X_2 = new_car_df[['Year', 'Make']]
y_2 = new_car_df['Category']

In [33]:
X_2.head()

Unnamed: 0,Year,Make
0,2021,Acura
1,2021,Acura
2,2021,Acura
3,2021,Alfa Romeo
4,2021,Alfa Romeo


In [34]:
y_2.head()

0    Sedan
1      SUV
2    Sedan
3    Sedan
4      SUV
Name: Category, dtype: object

## Reduce entries with >1 category and assign the first category

In [35]:
c_unique = y.unique()
c_unique

array(['Sedan', 'SUV', 'Wagon', 'Convertible,Sedan,Coupe', 'Hatchback',
       'Coupe', 'Pickup', 'Convertible,Coupe', 'Van/Minivan',
       'Coupe,Convertible', 'Hatchback,Sedan', 'Convertible',
       'Wagon,Sedan'], dtype=object)

In [36]:
d = {}    # collector dictionary

for c in y.unique():
    if ',' in c:
# d has c as the key, and it has the first item before ',' as the value
# example: key:value is 'Convertible,Sedan,Coupe': 'Convertible' respectively
          d[c] = c.split(',')[0]

print("===")
print(d)

===
{'Convertible,Sedan,Coupe': 'Convertible', 'Convertible,Coupe': 'Convertible', 'Coupe,Convertible': 'Coupe', 'Hatchback,Sedan': 'Hatchback', 'Wagon,Sedan': 'Wagon'}


In [37]:
y.replace(to_replace=d).unique()

array(['Sedan', 'SUV', 'Wagon', 'Convertible', 'Hatchback', 'Coupe',
       'Pickup', 'Van/Minivan'], dtype=object)

In [38]:
y = y.replace(to_replace=d)

In [39]:
y.unique()

array(['Sedan', 'SUV', 'Wagon', 'Convertible', 'Hatchback', 'Coupe',
       'Pickup', 'Van/Minivan'], dtype=object)

## Train Test Split

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3,
                                                    random_state=42)


In [41]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)


(168, 2) (168,)
(72, 2) (72,)
