# **Fundamentals of data collection, cleaning and preprocessing**

In [1]:
import numpy as np
import pandas as pd

## **Loading data**

In [2]:
cols = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
mpg_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original", 
                     delim_whitespace=True, 
                     names=cols)

In [3]:
mpg_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino


In [4]:
mpg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     406 non-null    float64
 2   displacement  406 non-null    float64
 3   horsepower    400 non-null    float64
 4   weight        406 non-null    float64
 5   acceleration  406 non-null    float64
 6   year          406 non-null    float64
 7   origin        406 non-null    float64
 8   name          406 non-null    object 
dtypes: float64(8), object(1)
memory usage: 28.7+ KB


### **Determine column type**
According to the dataset description,

1. mpg: **continuous**
2. cylinders: multi-valued discrete
3. displacement: **continuous**
4. horsepower: **continuous**
5. weight: **continuous**
6. acceleration: **continuous**
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

## **Preprocessing**

### **Perform min-max scaling**

In [5]:
numerical_cols = ['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']

In [6]:
numerical_df = mpg_df[numerical_cols]

In [7]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(numerical_df)



MinMaxScaler()

In [8]:
numerical_df_scaled = pd.DataFrame(scaler.transform(numerical_df),columns=numerical_cols)

In [9]:
numerical_df_scaled.head()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration
0,0.239362,0.617571,0.456522,0.53615,0.238095
1,0.159574,0.728682,0.646739,0.589736,0.208333
2,0.239362,0.645995,0.565217,0.51687,0.178571
3,0.18617,0.609819,0.565217,0.516019,0.238095
4,0.212766,0.604651,0.51087,0.520556,0.14881


Update the original DataFrame with the new DataFrame using [`update`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html) method.

In [10]:
mpg_df.update(numerical_df_scaled)

In [11]:
mpg_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,0.239362,8.0,0.617571,0.456522,0.53615,0.238095,70.0,1.0,chevrolet chevelle malibu
1,0.159574,8.0,0.728682,0.646739,0.589736,0.208333,70.0,1.0,buick skylark 320
2,0.239362,8.0,0.645995,0.565217,0.51687,0.178571,70.0,1.0,plymouth satellite
3,0.18617,8.0,0.609819,0.565217,0.516019,0.238095,70.0,1.0,amc rebel sst
4,0.212766,8.0,0.604651,0.51087,0.520556,0.14881,70.0,1.0,ford torino


### **Perform imputation**

In [12]:
np.sum(pd.isnull(mpg_df))

mpg             8
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
year            0
origin          0
name            0
dtype: int64

Since `mpg` and `horsepower` are all numerical and no other columns have missing values. We can safely use `fillna()` function.

In [13]:
mpg_df.mean()

  mpg_df.mean()


mpg              0.386026
cylinders        5.475369
displacement     0.327596
horsepower       0.321101
weight           0.387415
acceleration     0.447601
year            75.921182
origin           1.568966
dtype: float64

In [14]:
mpg_df = mpg_df.fillna(mpg_df.mean())

  mpg_df = mpg_df.fillna(mpg_df.mean())
