### Python Virtual Environment

In the command line we can write following code to create a virtual environment. This would help us to code anything according to our version we wish.

python -m venv path

We can see the virtual environment is created. In the folder we have `activate` and `deactivate` batch files available. We can enable or disable the environment by running commands in scripts folder as activate or deactivate respectively.

We can install libraries in the environment itself and not the system.

pip install numpy

pip install pandas

pip install scikit-learn

#### Load Required Libraries

In [122]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#### Load Dataset (CSV format file)

In [123]:
df = pd.read_csv('Dataset/employee_data.csv')
df

Unnamed: 0,City,Age,Salary,Eligible for bonus
0,Mumbai,27.0,51000.0,Yes
1,NewYork,27.0,48000.0,Yes
2,Mumbai,30.0,52000.0,No
3,NewYork,,66000.0,No
4,Tokyo,48.0,,Yes
5,Tokyo,,51000.0,No
6,Singapore,33.0,69000.0,No
7,NewYork,40.0,79000.0,Yes
8,Mumbai,38.0,,Yes
9,Singapore,35.0,38000.0,No


In [124]:
df.describe()

Unnamed: 0,Age,Salary
count,14.0,15.0
mean,36.5,63000.0
std,12.296028,19574.03528
min,16.0,38000.0
25%,30.25,50000.0
50%,35.0,56000.0
75%,39.5,75500.0
max,69.0,110000.0


#### Create dependent and Independent variables vectors

In [125]:
# Independent variables
x = dataset.iloc[:,:-1].values
x

array([['Mumbai', 27.0, 51000.0],
       ['NewYork', 27.0, 48000.0],
       ['Mumbai', 30.0, 52000.0],
       ['NewYork', nan, 66000.0],
       ['Tokyo', 48.0, nan],
       ['Tokyo', nan, 51000.0],
       ['Singapore', 33.0, 69000.0],
       ['NewYork', 40.0, 79000.0],
       ['Mumbai', 38.0, nan],
       ['Singapore', 35.0, 38000.0],
       ['Tokyo', nan, 56000.0],
       ['Singapore', 35.0, 72000.0],
       ['NewYork', 45.0, 79000.0],
       ['Mumbai', 31.0, 85000.0],
       ['Singapore', 37.0, 49000.0],
       ['Mumbai', 69.0, 110000.0],
       ['Tokyo', 16.0, 40000.0]], dtype=object)

In [126]:
# Dependent Variables
y = dataset.iloc[:,-1].values
y

array(['Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No',
       'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes'], dtype=object)

#### Handling Missing Values

- For `K Nearest Neighbor` and `Naive Bayes Theorem algorithms` we are `not required to handle missing values`. 

##### Count the number of missing values in each column

In [127]:
df.isnull().sum()

City                  0
Age                   3
Salary                2
Eligible for bonus    0
dtype: int64

##### Drop missing value records

In [128]:
df.dropna(inplace=True) 
# inplace = True means after dropping the na rows the rest of the rows are left as it is.
df

Unnamed: 0,City,Age,Salary,Eligible for bonus
0,Mumbai,27.0,51000.0,Yes
1,NewYork,27.0,48000.0,Yes
2,Mumbai,30.0,52000.0,No
6,Singapore,33.0,69000.0,No
7,NewYork,40.0,79000.0,Yes
9,Singapore,35.0,38000.0,No
11,Singapore,35.0,72000.0,No
12,NewYork,45.0,79000.0,Yes
13,Mumbai,31.0,85000.0,Yes
14,Singapore,37.0,49000.0,No


##### Replace missing values

In [129]:
# using SimpleImputer class from sklearn.impute we can create instance and identify missing values, fill these null values using the Strategy.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(x[:,1:3])
x[:,1:3] = imputer.transform(x[:,1:3])

##### Using median instead of mean to fill the null values would help us to generate values in the range of most values as we may have some outliers which could increase the average value.

In [130]:
x

array([['Mumbai', 27.0, 51000.0],
       ['NewYork', 27.0, 48000.0],
       ['Mumbai', 30.0, 52000.0],
       ['NewYork', 35.0, 66000.0],
       ['Tokyo', 48.0, 56000.0],
       ['Tokyo', 35.0, 51000.0],
       ['Singapore', 33.0, 69000.0],
       ['NewYork', 40.0, 79000.0],
       ['Mumbai', 38.0, 56000.0],
       ['Singapore', 35.0, 38000.0],
       ['Tokyo', 35.0, 56000.0],
       ['Singapore', 35.0, 72000.0],
       ['NewYork', 45.0, 79000.0],
       ['Mumbai', 31.0, 85000.0],
       ['Singapore', 37.0, 49000.0],
       ['Mumbai', 69.0, 110000.0],
       ['Tokyo', 16.0, 40000.0]], dtype=object)

#### Data Encoding: Handle/encode categorical data

##### OneHot Encoding

In [131]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
x

array([[1.0, 0.0, 0.0, 0.0, 27.0, 51000.0],
       [0.0, 1.0, 0.0, 0.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 0.0, 30.0, 52000.0],
       [0.0, 1.0, 0.0, 0.0, 35.0, 66000.0],
       [0.0, 0.0, 0.0, 1.0, 48.0, 56000.0],
       [0.0, 0.0, 0.0, 1.0, 35.0, 51000.0],
       [0.0, 0.0, 1.0, 0.0, 33.0, 69000.0],
       [0.0, 1.0, 0.0, 0.0, 40.0, 79000.0],
       [1.0, 0.0, 0.0, 0.0, 38.0, 56000.0],
       [0.0, 0.0, 1.0, 0.0, 35.0, 38000.0],
       [0.0, 0.0, 0.0, 1.0, 35.0, 56000.0],
       [0.0, 0.0, 1.0, 0.0, 35.0, 72000.0],
       [0.0, 1.0, 0.0, 0.0, 45.0, 79000.0],
       [1.0, 0.0, 0.0, 0.0, 31.0, 85000.0],
       [0.0, 0.0, 1.0, 0.0, 37.0, 49000.0],
       [1.0, 0.0, 0.0, 0.0, 69.0, 110000.0],
       [0.0, 0.0, 0.0, 1.0, 16.0, 40000.0]], dtype=object)

##### Label Encoding

In [132]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = np.array(le.fit_transform(y))
y

array([1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1])

#### Splitting data into train and test data

In [133]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25, random_state=1)

In [134]:
x_train

array([[0.0, 0.0, 0.0, 1.0, 35.0, 56000.0],
       [0.0, 0.0, 0.0, 1.0, 48.0, 56000.0],
       [0.0, 1.0, 0.0, 0.0, 27.0, 48000.0],
       [0.0, 0.0, 1.0, 0.0, 37.0, 49000.0],
       [1.0, 0.0, 0.0, 0.0, 27.0, 51000.0],
       [0.0, 0.0, 0.0, 1.0, 16.0, 40000.0],
       [1.0, 0.0, 0.0, 0.0, 69.0, 110000.0],
       [0.0, 0.0, 1.0, 0.0, 35.0, 38000.0],
       [1.0, 0.0, 0.0, 0.0, 38.0, 56000.0],
       [0.0, 1.0, 0.0, 0.0, 45.0, 79000.0],
       [0.0, 0.0, 1.0, 0.0, 35.0, 72000.0],
       [0.0, 0.0, 0.0, 1.0, 35.0, 51000.0]], dtype=object)

In [135]:
x_test

array([[0.0, 1.0, 0.0, 0.0, 35.0, 66000.0],
       [1.0, 0.0, 0.0, 0.0, 31.0, 85000.0],
       [0.0, 1.0, 0.0, 0.0, 40.0, 79000.0],
       [1.0, 0.0, 0.0, 0.0, 30.0, 52000.0],
       [0.0, 0.0, 1.0, 0.0, 33.0, 69000.0]], dtype=object)

In [136]:
y_train

array([0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0])

In [137]:
y_test

array([0, 1, 1, 0, 0])

#### Feature Scaling - Standardization & Normalization

- If we have one column with smaller values like in 100's and other column with values in millions then the columns with millions can take more computation cost as well as have more impact while traing our machine learning model. That is why we need to scale all our features on the same scale to have similar impact while training the model. 

- If we scale all values between 0 to 1 in that case computing these values will be more faster and easy.

![Screenshot%202022-12-30%20at%209.59.46%20PM.png](attachment:Screenshot%202022-12-30%20at%209.59.46%20PM.png)

- In Standardization we try to scale our all feature values between -3 to +3. `We can use Standardization anywhere for any particular data`.
- In Normalization we try to feature scale between 0 to 1. It is always positive. `We can use Normalization only when the data is normally distributed or well distributed`.

In [138]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train[:,4:] = scaler.fit_transform(x_train[:,4:])
x_test[:,4:] = scaler.fit_transform(x_test[:,4:])

In [139]:
x_train

array([[0.0, 0.0, 0.0, 1.0, -0.1804214757401082, -0.14885388509396494],
       [0.0, 0.0, 0.0, 1.0, 0.8620137174249614, -0.14885388509396494],
       [0.0, 1.0, 0.0, 0.0, -0.8219200561493817, -0.5691472077122185],
       [0.0, 0.0, 1.0, 0.0, -0.020046830637789798, -0.5166105423849368],
       [1.0, 0.0, 0.0, 0.0, -0.8219200561493817, -0.4115372117303734],
       [0.0, 0.0, 0.0, 1.0, -1.7039806042121328, -0.989440530330472],
       [1.0, 0.0, 0.0, 0.0, 2.5459474909993043, 2.6881260425792464],
       [0.0, 0.0, 1.0, 0.0, -0.1804214757401082, -1.0945138609850353],
       [1.0, 0.0, 0.0, 0.0, 0.0601404919133694, -0.14885388509396494],
       [0.0, 1.0, 0.0, 0.0, 0.6214517497714838, 1.059489417433514],
       [0.0, 0.0, 1.0, 0.0, -0.1804214757401082, 0.6917327601425421],
       [0.0, 0.0, 0.0, 1.0, -0.1804214757401082, -0.4115372117303734]],
      dtype=object)

In [140]:
x_test

array([[0.0, 1.0, 0.0, 0.0, 0.338599588789861, -0.3692744729379982],
       [1.0, 0.0, 0.0, 0.0, -0.7900657071763397, 1.3012529046386603],
       [0.0, 1.0, 0.0, 0.0, 1.7494312087476118, 0.773717943298663],
       [1.0, 0.0, 0.0, 0.0, -1.07223203116789, -1.6001893827313256],
       [0.0, 0.0, 1.0, 0.0, -0.22573305919323933, -0.10550699226799949]],
      dtype=object)

#### Outlier Detection and Removal