# Resources

https://medium.com/analytics-vidhya/data-cleaning-and-preprocessing-a4b751f4066f

https://towardsdatascience.com/the-complete-beginners-guide-to-data-cleaning-and-preprocessing-2070b7d4c6d

https://medium.com/sciforce/data-cleaning-and-preprocessing-for-beginners-25748ee00743

In [1]:
# Data preprocessing: 
# Transformation of the raw dataset into an understandable format. 
# Improve data efficiency. 
# The data preprocessing methods directly affect the outcomes of any analytic algorithm.

In [2]:
# Steps in Data Preprocessing:
# 1. Gathering the data
# 2. Import the dataset & Libraries
# 3. Dealing with Missing Values
# 4. Divide the dataset into Dependent & Independent variable
# 5. dealing with Categorical values
# 6. Split the dataset into training and test set
# 7. Feature Scaling

In [3]:
# Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
# Reading/importing the dataset
df = pd.read_csv("train.csv")
df

Unnamed: 0,Id,ProductId,UserId,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,0,5019281,ADZPIG9QOCDG5,0,0,4.0,1203984000,good version of a classic,This is a charming version of the classic Dick...
1,1,5019281,A35947ZP82G7JH,0,0,3.0,1388361600,Good but not as moving,It was good but not as emotionally moving as t...
2,2,5019281,A3UORV8A9D5L2E,0,0,3.0,1388361600,Winkler's Performance was ok at best!,"Don't get me wrong, Winkler is a wonderful cha..."
3,3,5019281,A1VKW06X1O2X7V,0,0,5.0,1202860800,It's an enjoyable twist on the classic story,Henry Winkler is very good in this twist on th...
4,4,5019281,A3R27T4HADWFFJ,0,0,4.0,1387670400,Best Scrooge yet,This is one of the best Scrooge movies out. H...
...,...,...,...,...,...,...,...,...,...
994,994,310263662,A2VCW1OQD4GHC,0,0,5.0,1350777600,Passion of the Christ,This video is very touching and brings out the...
995,995,310263662,A18D1RLW38LVLW,2,4,5.0,1081382400,"Great movie, VERY strong though...",I went to see this movie with my girlfriend in...
996,996,310263662,A2EMP366TTS6E1,1,2,5.0,1081296000,--Heart wrenching portrayal of The Passion of ...,After hearing so many controversial things abo...
997,997,310263662,A1VYD8OKS7VICD,7,13,4.0,1083283200,The Ultimate Torturing of Jesus Christ,I had read many reviews before I saw this film...


In [5]:
# Information about total number of entries and total number of non-null values and their datatypes.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Id                      999 non-null    int64  
 1   ProductId               999 non-null    int64  
 2   UserId                  999 non-null    object 
 3   HelpfulnessNumerator    999 non-null    int64  
 4   HelpfulnessDenominator  999 non-null    int64  
 5   Score                   817 non-null    float64
 6   Time                    999 non-null    int64  
 7   Summary                 999 non-null    object 
 8   Text                    999 non-null    object 
dtypes: float64(1), int64(5), object(3)
memory usage: 70.4+ KB


In [6]:
# Finding the null values for each column and returning the total number of null values for each column.
# df.isna(): Shows the null values in the dataset, 5 top rows and 5 bottom rows only.
df.isnull().sum() # You can also use insa() instead of isnull()

Id                          0
ProductId                   0
UserId                      0
HelpfulnessNumerator        0
HelpfulnessDenominator      0
Score                     182
Time                        0
Summary                     0
Text                        0
dtype: int64

In [7]:
# Replacing the null values with the mean.
df.fillna(df.mean(), inplace=True)

In [8]:
# Checking the number of null values in each column again which should be 0 this time.
print(df.isnull().sum())

Id                        0
ProductId                 0
UserId                    0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   0
Text                      0
dtype: int64


In [9]:
# Diviging the dataset into dependent and independent variables
# iloc of pandas: Fixes indexes for selection df.iloc[rowSelection, columnSelection]
X = df.iloc[:, [0, 1, 2, 3, 4, 6, 7, 8]]
Y = df.iloc[:, [5]]

In [10]:
X

Unnamed: 0,Id,ProductId,UserId,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text
0,0,5019281,ADZPIG9QOCDG5,0,0,1203984000,good version of a classic,This is a charming version of the classic Dick...
1,1,5019281,A35947ZP82G7JH,0,0,1388361600,Good but not as moving,It was good but not as emotionally moving as t...
2,2,5019281,A3UORV8A9D5L2E,0,0,1388361600,Winkler's Performance was ok at best!,"Don't get me wrong, Winkler is a wonderful cha..."
3,3,5019281,A1VKW06X1O2X7V,0,0,1202860800,It's an enjoyable twist on the classic story,Henry Winkler is very good in this twist on th...
4,4,5019281,A3R27T4HADWFFJ,0,0,1387670400,Best Scrooge yet,This is one of the best Scrooge movies out. H...
...,...,...,...,...,...,...,...,...
994,994,310263662,A2VCW1OQD4GHC,0,0,1350777600,Passion of the Christ,This video is very touching and brings out the...
995,995,310263662,A18D1RLW38LVLW,2,4,1081382400,"Great movie, VERY strong though...",I went to see this movie with my girlfriend in...
996,996,310263662,A2EMP366TTS6E1,1,2,1081296000,--Heart wrenching portrayal of The Passion of ...,After hearing so many controversial things abo...
997,997,310263662,A1VYD8OKS7VICD,7,13,1083283200,The Ultimate Torturing of Jesus Christ,I had read many reviews before I saw this film...


In [11]:
Y

Unnamed: 0,Score
0,4.000000
1,3.000000
2,3.000000
3,5.000000
4,4.000000
...,...
994,5.000000
995,5.000000
996,5.000000
997,4.000000


In [12]:
# Since the models are based on calculations and mathematical equations, it is harder for the computers to understand texts.
# Hence, we need to encode the categorical data.
# Create an object of the LableEncoder class.
labelEncoder = LabelEncoder()

In [13]:
X.iloc[:, 2]= labelEncoder.fit_transform(X.iloc[:, 2])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [14]:
X.iloc[:, 6]= labelEncoder.fit_transform(X.iloc[:, 6])

In [15]:
X.iloc[:, 7]= labelEncoder.fit_transform(X.iloc[:, 7])

In [16]:
X

Unnamed: 0,Id,ProductId,UserId,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text
0,0,5019281,804,0,0,1203984000,882,754
1,1,5019281,535,0,0,1388361600,347,499
2,2,5019281,718,0,0,1388361600,845,146
3,3,5019281,232,0,0,1202860800,455,218
4,4,5019281,681,0,0,1387670400,206,811
...,...,...,...,...,...,...,...,...
994,994,310263662,471,0,0,1350777600,608,871
995,995,310263662,60,2,4,1081382400,378,428
996,996,310263662,364,1,2,1081296000,17,62
997,997,310263662,236,7,13,1083283200,740,299


In [17]:
# Create Dummy Variables do that the system does not think 1 is better than 0 for example. This way every category is assigned a column.
# onehotencoder = OneHotEncoder(categorical_features=[0])
ct = ColumnTransformer([("UserId", OneHotEncoder(),[2])]) # The last arg ([0]) is the list of columns you want to transform in this step
ct.fit_transform(X).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [18]:
ct = ColumnTransformer([("Summary", OneHotEncoder(),[6])]) # The last arg ([0]) is the list of columns you want to transform in this step
ct.fit_transform(X).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [19]:
ct = ColumnTransformer([("Text", OneHotEncoder(),[7])]) # The last arg ([0]) is the list of columns you want to transform in this step
ct.fit_transform(X)
ct.fit_transform(X).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [20]:
X

Unnamed: 0,Id,ProductId,UserId,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text
0,0,5019281,804,0,0,1203984000,882,754
1,1,5019281,535,0,0,1388361600,347,499
2,2,5019281,718,0,0,1388361600,845,146
3,3,5019281,232,0,0,1202860800,455,218
4,4,5019281,681,0,0,1387670400,206,811
...,...,...,...,...,...,...,...,...
994,994,310263662,471,0,0,1350777600,608,871
995,995,310263662,60,2,4,1081382400,378,428
996,996,310263662,364,1,2,1081296000,17,62
997,997,310263662,236,7,13,1083283200,740,299


In [21]:
# Splitting the data into train and test dataset. This is has already been done and we are working with the train data.
# But for the purpose of the demo we will split the train dataset.
# Normally 70:30 or 80:20
# random_state: Data is shuffled before beging splitted. If you do not want random shuffling, you can pass an integer to it.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

In [22]:
# Feature Scaling: Used to standardized the range of the independent features.
# It highly effects ML models. Features may vary in magnitude and range and ML models normally use Euclidean distance between 
# two data points so it can cause problems if we don't scale the features.
# Different ways: Standardization, Normalization, Min-Max Scalar Technique.
# Apply standardization formula and create an instance of that class.
standardScaler_x = StandardScaler()

In [23]:
# Transform data into the standard scale.
X_train = standardScaler_x.fit_transform(X_train)
X_test = standardScaler_x.fit_transform(X_test)

In [24]:
print(X_train)

[[-0.33792631  0.50987315 -0.80778811 ... -1.65525021 -0.60941199
  -1.64958644]
 [ 1.45385758  0.53189931 -0.71360105 ... -1.02935078 -1.57434285
  -0.34913661]
 [-0.81273171  0.50689567 -1.24481607 ...  1.1229901   1.27866138
  -0.05593173]
 ...
 [ 0.47305518  0.53189931 -0.78141574 ... -0.91248159 -0.65499928
   1.48943041]
 [ 0.23045388  0.53189931 -0.99239475 ...  0.58019765  1.62816389
  -1.66683379]
 [ 0.66367049  0.53189931 -1.05644195 ... -0.88845848  1.19128575
   1.41699156]]


In [25]:
print(X_test)

[[ 0.72311287  0.4633371  -0.21525718 ... -0.6134449   1.04879812
   1.56243222]
 [-1.52067064 -2.20978348 -1.40403761 ...  1.11879687 -1.44011704
  -0.02893393]
 [ 1.26231665  0.4633371   1.16864222 ... -0.83448446  0.32977819
   0.44741736]
 ...
 [-1.42326609 -2.20978348 -1.63528973 ...  0.66359968  1.58714126
  -0.87930919]
 [-1.55197925 -2.20978348  1.12528245 ...  1.33983643 -1.63554297
  -0.85108097]
 [ 0.07258959  0.4633371  -0.12492432 ...  0.10739331  0.49201858
   0.65912904]]


In [26]:
# You do not do scaling for the dependent varibale in a classification however you do it for regression.