## CSE 422 Introduction to Data Preprocessing
---







### What are the advantages of preprocessing the data before applying on machine learning algorithm?

"The biggest advantage of pre-processing in ML is to improve **generalizablity** of your model. Data for any ML application is collected through some ‘sensors’. These sensors can be physical devices, instruments, software programs such as web crawlers, manual surveys, etc. Due to hardware malfunctions, software glitches, instrument failures, amd human errors, noise and erroneous information may creep in that can severely affect the performance of your model. Apart from **noise**, there are several **redundant information** that needs to be removed. For e.g. while predicting whether it rains tomorrow or not, age of the person is irrelevant. In terms of text processing, there are several stop words that may be redundant for the analysis. Lastly, there may be several **outliers** present in your data, due to the way data is collected that may need to be removed to improve the performance of the classifiers."
                                    
                                            -Shehroz Khan, ML Researcher, Postdoc @U of Toronto


Some Data Preprocessing Techniques:

* Deleting duplicate and null values
* Imputation for missing values
* Handling Categorical Features
* Feature Normalization/Scaling
* Feature Engineering
* Feature Selection

In [None]:
#importing necessary libraries
import pandas as pd
import numpy as np


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Removing Null values / Handling Missing data




In [None]:
volunteer = pd.read_csv('/content/drive/MyDrive/AI Paper/imdb_top_1000.csv')
volunteer.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444


In [None]:
volunteer.shape

(1000, 16)

In [None]:
volunteer.isnull().sum()

Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

In [None]:
# Check how many values are missing in the category_desc column
print("Number of rows with null values in Certificate column: ", volunteer['Certificate'].isnull().sum())

# Subset the volunteer dataset

Certificate = volunteer[volunteer['Certificate'].notnull()]

# Print out the shape of the subset
print("Shape after removing null values: ", Certificate.shape)

Number of rows with null values in Certificate column:  101
Shape after removing null values:  (899, 16)


In [None]:
print("Shape of dataframe before dropping:", volunteer.shape)
volunteer = volunteer.dropna(axis = 0, subset = ['Certificate'])
volunteer = volunteer.dropna(axis = 0, subset = ['Meta_score'])
volunteer = volunteer.dropna(axis = 0, subset = ['Gross'])
print("Shape after dropping:", volunteer.shape)

Shape of dataframe before dropping: (1000, 16)
Shape after dropping: (714, 16)


In [None]:
input= ["A young, easy-going gunman worships and competes with a famed gunfighter, insisting that he must face down a gang of 150 outlaws before he can retire."]


In [None]:
volunteer.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444


### Dropping columns

In [None]:
for i in volunteer:
  if i!= "Overview" and i!= "IMDB_Rating" and i!= "Series_Title":
    volunteer = volunteer.drop(i, axis = 1)
volunteer.head(3)

Unnamed: 0,Series_Title,IMDB_Rating,Overview
0,The Shawshank Redemption,9.3,Two imprisoned men bond over a number of years...
1,The Godfather,9.2,An organized crime dynasty's aging patriarch t...
2,The Dark Knight,9.0,When the menace known as the Joker wreaks havo...


##Creating an dictionary to categorize rating

In [None]:
rating_list= []

In [None]:
for i in range(7):
  rating=7.5+i*0.25
  print(rating)
  volunteer[str(rating)+'-'+str(rating+0.25)]= np.where((volunteer["IMDB_Rating"]>=rating) & (volunteer["IMDB_Rating"]<rating+0.25), 1, 0)


7.5
7.75
8.0
8.25
8.5
8.75
9.0


In [None]:
volunteer.head(10)

Unnamed: 0,Series_Title,IMDB_Rating,Overview,7.5-7.75,7.75-8.0,8.0-8.25,8.25-8.5,8.5-8.75,8.75-9.0,9.0-9.25
0,The Shawshank Redemption,9.3,Two imprisoned men bond over a number of years...,0,0,0,0,0,0,0
1,The Godfather,9.2,An organized crime dynasty's aging patriarch t...,0,0,0,0,0,0,1
2,The Dark Knight,9.0,When the menace known as the Joker wreaks havo...,0,0,0,0,0,0,1
3,The Godfather: Part II,9.0,The early life and career of Vito Corleone in ...,0,0,0,0,0,0,1
4,12 Angry Men,9.0,A jury holdout attempts to prevent a miscarria...,0,0,0,0,0,0,1
5,The Lord of the Rings: The Return of the King,8.9,Gandalf and Aragorn lead the World of Men agai...,0,0,0,0,0,1,0
6,Pulp Fiction,8.9,"The lives of two mob hitmen, a boxer, a gangst...",0,0,0,0,0,1,0
7,Schindler's List,8.9,"In German-occupied Poland during World War II,...",0,0,0,0,0,1,0
8,Inception,8.8,A thief who steals corporate secrets through t...,0,0,0,0,0,1,0
9,Fight Club,8.8,An insomniac office worker and a devil-may-car...,0,0,0,0,0,1,0


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.metrics import f1_score
for i in range(10):
  for i in range(7):
    rating=7.5+i*0.25
    x_train, x_test, y_train, y_test= train_test_split(volunteer.Overview, volunteer[str(rating)+ "-"+ str(rating+0.25)])
    v= CountVectorizer()
    x_train_vec= v.fit_transform(x_train)
    x_test_vec= v.transform(x_test)
    clf_svm= svm.SVC(kernel= "linear")
    clf_svm.fit(x_train_vec,y_train)
    clf_svm.score(x_test_vec, y_test)
    print(f1_score(y_test, clf_svm.predict(x_test_vec), average= None))
    input_vec= v.transform(input)
    print(clf_svm.predict(input_vec))
    if clf_svm.predict(input_vec)== 1:
      rating_list.append(rating)

[0.71042471 0.24242424]
[0]
[0.7804878  0.11267606]
[1]
[0.75524476 0.02777778]
[0]
[0.95321637 0.        ]
[0]
[0.97109827 0.16666667]
[0]
[0.99438202 0.        ]
[0]
[0.99719888 0.        ]
[0]
[0.6848249 0.1980198]
[0]
[0.78745645 0.14084507]
[1]
[0.79298246 0.19178082]
[0]
[0.95626822 0.        ]
[0]
[0.97714286 0.        ]
[0]
[0.99719888 0.        ]
[0]
[0.99438202 0.        ]
[0]
[0.71212121 0.19148936]
[0]
[0.76760563 0.10810811]
[1]
[0.76056338 0.08108108]
[0]
[0.95930233 0.        ]
[0]
[0.98295455 0.        ]
[0]
[0.99438202 0.        ]
[0]
[0.99719888 0.        ]
[0]
[0.70498084 0.20618557]
[0]
[0.78291815 0.20779221]
[1]
[0.74551971 0.10126582]
[0]
[0.95930233 0.        ]
[0]
[0.97126437 0.        ]
[0]
[0.99719888 0.        ]
[0]
[0.99719888 0.        ]
[0]
[0.74906367 0.26373626]
[1]
[0.78169014 0.16216216]
[1]
[0.69888476 0.08988764]
[0]
[0.93452381 0.        ]
[0]
[0.97126437 0.        ]
[0]
[0.99438202 0.        ]
[0]
[1.]
[0]
[0.70188679 0.15053763]
[0]
[0.74368231 0

In [None]:
volunteer.head(10)
print(rating_list)
sum_result= 0
for i in rating_list:
  sum_result+=i
print(sum_result/len(rating_list))

[7.75, 7.75, 7.75, 7.75, 7.5, 7.75, 7.75, 7.75, 7.75, 7.75]
7.725
