# Objectives: 

This is Scikit-learn Part I. This notebook mainly focus on how to use scikit-learn to deal with dataset. You will learn: 

1. standardization 
2. normalization 
3. encoding categorical features
4. filling missing values

# Introduction

SciKit-Learn (often referred to as sklearn) provides a wide array of statistical models and machine learning. sklearn, unlike most modules, is written in Python and not in C. Although it is written in Python, sklearn’s performance is attributed to its usage of NumPy for high-performance linear algebra and array operations.

## Installing the Scikit-Learn library

Scikit-Learn requires the following libraries to be pre-installed: NumPy, SciPy, Matplotlib, IPython, Sympy, and Pandas.

>1. pip install numpy
2. pip install scipy
3. pip install matplotlib
4. pip install ipython
5. pip install sympy
6. pip install pandas

Now that we’ve installed the dependent libraries let us install Scikit-Learn.

> pip install scikit-learn

# Application 1. Scikit-Learn for standardization


### Why do we need standardization? 

Distance based models are machine learning algorithms that use distances to check if two data points are similar or not. If two points are close together, one can infer that the feature values are simiar and hence, can be classified as similar. 

Standardization is an essential task for distance based models so that one particular feature does not dominate over the other.

### How to standairze our data? 
A data point $x$ is standarized as follows: 

$$ Z = \frac{x - \mu}{\sigma} $$ 
where $\mu$ is the mean of the distribution and $\sigma$ is the standard deviation of the distribution. 

After standardization, all the data points are between $-1$ and $1$. Its mean is $0$ and its standard deviation is $1$. 

### Example of standarization with Scikit-learn


Suppose temperatures recorded in Bloomington (in Fahrenheits) in Illinois in month of January are: 

temperatures_list = [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,
                    32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]



In [1]:
# Import libraries
from sklearn.preprocessing import StandardScaler
import numpy as np

In [2]:
# List of temperatures recorded in Bloomington (in Fahrenheits) in Illinois in month of January
temperatures_list = [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,
                    32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]

# Convert the list to a NumPy array
temperatures_np = np.array(temperatures_list).reshape(-1,1)

In [3]:
# Standardize the vector
temperatures_std = StandardScaler().fit_transform(temperatures_np)

# Print the means
print("Mean Before Standardizing:",sum(temperatures_list)/len(temperatures_list))
print("Mean After Standardizing:",sum(temperatures_std.reshape(1,-1)[0])/len(temperatures_std))

# Output:
# Mean Before Standardizing: 32.896774193548396
# Mean After Standardizing: -2.6215588839535955e-15

Mean Before Standardizing: 32.896774193548396
Mean After Standardizing: -2.6215588839535955e-15


# Application 2. Scikit-Learn for normalization

### Why do we need normalization? 
Normalization is another feature scaling technique used to transform the values of the numeric attributes to a standard scale (0 to 1). Normalization is used in cases where the values do not follow Gaussian distribution. 

#### Rule of thumb - Standardize if the attribute can be modeled to be a Gaussian distribution. If not, normalize).

Normalization is important because it does not provide a window for the model to prefer one attribute because of the scale of values.

### How to normalize our data? 
A data point $x$ is normalized as follows:

$$ x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}$$

### Example of normalization with Scikit-learn

In [4]:
#Import libraries
from sklearn.preprocessing import MinMaxScaler
import numpy as np

#List of temperatures recorded in Bloomington
temperatures_list = [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,
                    32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]

#Convert the list to a NumPy array
temperatures_np = np.array(temperatures_list).reshape(-1,1)



In [5]:
#Normalize the vector
temperatures_norm = MinMaxScaler().fit_transform(temperatures_np)

print("Minimum Value Before Normalization:",min(temperatures_np.reshape(1,-1)[0]))
print("Maximum Value Before Normalization:",max(temperatures_np.reshape(1,-1)[0]))
print("Minimum Value After Normalization:",min(temperatures_norm))
print("Maximum Value After Normalization:",max(temperatures_norm))

# Output:
# Minimum Value Before Normalization: 32.5
# Maximum Value Before Normalization: 33.9
# Minimum Value After Normalization: [0.]
# Maximum Value After Normalization: [1.]

Minimum Value Before Normalization: 32.5
Maximum Value Before Normalization: 33.9
Minimum Value After Normalization: [0.]
Maximum Value After Normalization: [1.]


# Application 3. Scikit-Learn when encoding categorical features

### Why do we need to encode categorical features? 
Almost every dataset has a feature (or more than one feature), that is categorical. For example, consider a dataset containing the details of all the passengers of a certain airline. The possible categorical variables in the dataset could be the passenger’s gender (male/female) and their seating choice (economy, business, first-class). All values have be numerical data for modelling, and hence, these categorical features have to be encoded.

### 2 types of encoding - Label Encoding and One Hot Encoding

Consider we have a list of customer's seating choices: business, economy, first-class. 

    Label Encoding would mean replacing all “business” with 0, all “economy” with 1, and all “first-class” with 2. 

    One hot encoding would have three features, 1 representing if the customer has indeed the seating choice, 0 indicating otherwise.


### Example of Label-Encoding

In [6]:
from sklearn.preprocessing import LabelEncoder

# label encoder will automatically consider: 
# business = 0, economy = 1, and first-class = 2 

seating = ['economy', 'business', 'first-class', 'business']

# Invoking an instance of Label Encoder
label_encoding = LabelEncoder()

# Fit the labels
encoded = label_encoding.fit(seating)

print(encoded.transform(seating))

# Output - [1 0 2 0]

[1 0 2 0]


If one were to look at the output, they would understand that the feature has been encoded. But these numbers do not make any sense. Luckily, .classes_ help us interpret what these labels are.

In [7]:
#Iterate through the classes_ list and print them
seating_list = encoded.classes_

for seating_number in range(len(seating_list)):
    print(seating_number, seating_list[seating_number])

# Output
# 1 business 
# 2 economy 
# 3 first-class 

0 business
1 economy
2 first-class


### Example of one-hot-Encoding

If the seating_list feature is one-hot encoded, it would be represented in 1’s and 0’s instead of decimals.

for example, 

    business    = [1. 0. 0.] 
    economy     = [0. 1. 0.]
    first-class = [0. 0. 1.]

In [8]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

seating = ['economy', 'business', 'first-class', 'business']

# you have to convert the list into array in order to use one-hot encoder 
seating_numpy_array = np.array(seating).reshape(-1,1)

# Invoking an instance of Label Encoder
label_encoding = OneHotEncoder()

# Fit the labels
encoded = label_encoding.fit(seating_numpy_array)

print(encoded.transform(seating_numpy_array).toarray())

# Output
# [[0. 1. 0.]
#  [1. 0. 0.]
#  [0. 0. 1.]
#  [1. 0. 0.]]

[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


# Application 4. Scikit-Learn filling missing values

### Why do we need scikit-learn to fill missing values? 

Almost 70% of time and resources are spent on collecting and cleaning the dataset for every project. When one deals with a real-life dataset, there are always missing values. Cleaning the dataset and handling missing data is important as many machine learning algorithms do not accommodate a missing attribute in the data.

### Methods of dealing with missing values 

    1. remove the row of data with a missing value, that would mean losing valuable-yet-incomplete data. 
    2. replace the missing values with values that can be inferred from known data. 
    3. replace the missing data with the mean of that column.

Missing values are encoded with NumPy’s NaN (numpy.nan)



### Example of using Scikit-learn to fill missing values 

Suppose the temperatures recorded in Bloomington (in Fahrenheits) in Illinois in the month of February:
    
temperatures = [33.2,32.8,32.9,33.0,"NaN",33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,"NaN",32.7,32.7,32.6,"NaN",32.6,32.9,32.8,
                32.8,32.5,32.6,"NaN",32.6,32.7,32.7,33.5, 33.7,33.9]
    
##### Let’s try to replace the missing temperatures with their mean.

In [9]:
import numpy as np
from sklearn.impute import SimpleImputer

#List of temperatures
temperatures = [33.2,32.8,32.9,33.0,"NaN",33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,"NaN",32.7,32.7,32.6,"NaN",32.6,32.9,32.8,
                32.8,32.5,32.6,"NaN",32.6,32.7,32.7,33.5, 33.7,33.9]

temperatures_cleaned = []

#Replace NaN's with np.nan
for temperature in temperatures:
    if temperature=="NaN":
        temperatures_cleaned.append(np.nan)
    else:
        temperatures_cleaned.append(temperature)
        
# convert the list into an array and reshape it 
temperatures_np = np.array(temperatures_cleaned).reshape(-1,1)

# Create an instance of the imputer
imputer_mean = SimpleImputer(missing_values=np.nan,strategy='mean')

#Transform the array and fit according to the chosen strategy
temperatures_np = imputer_mean.fit_transform(temperatures_np)

# reshape the output array 
temperatures_np.reshape(1,len(temperatures_np))

array([[33.2       , 32.8       , 32.9       , 33.        , 32.91111111,
        33.2       , 33.4       , 33.1       , 32.6       , 32.5       ,
        32.5       , 33.1       , 33.        , 32.91111111, 32.7       ,
        32.7       , 32.6       , 32.91111111, 32.6       , 32.9       ,
        32.8       , 32.8       , 32.5       , 32.6       , 32.91111111,
        32.6       , 32.7       , 32.7       , 33.5       , 33.7       ,
        33.9       ]])

### SimpleImputer provides four options for strategy 

    - mean, median, most_frequent, and constant. 
    
    Since mean was the chosen strategy, the nan’s were replaced with the mean of the temperatures (32.91111111).

##### Let’s try to replace the missing temperatures with their most_frequent.

In [11]:
#List of temperatures
temperatures = [33.2,32.8,32.9,33.0,"NaN",33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,"NaN",32.7,32.7,32.6,"NaN",32.6,32.9,32.8,
                32.8,32.5,32.6,"NaN",32.6,32.7,32.7,33.5, 33.7,33.9]

temperatures_cleaned = []

#Replace NaN's with np.nan
for temperature in temperatures:
    if temperature=="NaN":
        temperatures_cleaned.append(np.nan)
    else:
        temperatures_cleaned.append(temperature)
        
# convert the list into an array and reshape it 
temperatures_np = np.array(temperatures_cleaned).reshape(-1,1)

# Create an instance of the imputer
imputer_most_frequent = SimpleImputer(missing_values=np.nan,strategy='most_frequent')

#Transform the array and fit according to the chosen strategy
temperatures_np = imputer_most_frequent.fit_transform(temperatures_np)

# reshape the output array 
temperatures_np.reshape(1,len(temperatures_np))

array([[33.2, 32.8, 32.9, 33. , 32.6, 33.2, 33.4, 33.1, 32.6, 32.5, 32.5,
        33.1, 33. , 32.6, 32.7, 32.7, 32.6, 32.6, 32.6, 32.9, 32.8, 32.8,
        32.5, 32.6, 32.6, 32.6, 32.7, 32.7, 33.5, 33.7, 33.9]])

##### Let’s try to replace the missing temperatures with a constant 32.9 

In [12]:
#List of temperatures
temperatures = [33.2,32.8,32.9,33.0,"NaN",33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,"NaN",32.7,32.7,32.6,"NaN",32.6,32.9,32.8,
                32.8,32.5,32.6,"NaN",32.6,32.7,32.7,33.5, 33.7,33.9]

temperatures_cleaned = []

#Replace NaN's with np.nan
for temperature in temperatures:
    if temperature=="NaN":
        temperatures_cleaned.append(np.nan)
    else:
        temperatures_cleaned.append(temperature)
        
# convert the list into an array and reshape it 
temperatures_np = np.array(temperatures_cleaned).reshape(-1,1)

# Create an instance of the imputer
imputer_constant = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=32.9)

#Transform the array and fit according to the chosen strategy
temperatures_np = imputer_constant.fit_transform(temperatures_np)

# reshape the output array 
temperatures_np.reshape(1,len(temperatures_np))

array([[33.2, 32.8, 32.9, 33. , 32.9, 33.2, 33.4, 33.1, 32.6, 32.5, 32.5,
        33.1, 33. , 32.9, 32.7, 32.7, 32.6, 32.9, 32.6, 32.9, 32.8, 32.8,
        32.5, 32.6, 32.9, 32.6, 32.7, 32.7, 33.5, 33.7, 33.9]])

# Summary: 

After going over this notebook, you shall be able to use Scikit-learn to do the followingS: 

1. standardization 
2. normalization 
3. encoding categorical features
4. filling missing values

# Next: scikit-learn part 2 for machine learning 