# Data Preprocessing

# Importance of data preprocessing


* The need for data preprocessing is there because good data is  more important than good models and for which the quality of the data is of paramount importance. 
* Therefore, companies and individuals invest a lot of their time in cleaning and preparing the data for modeling. 
* The data present in the real world contains a lot of quality issues, noise, inaccurate, and not complete. 
* It may not contain relevant, specific attributes and could have missing values, even incorrect and spurious values. 
* To improve the quality of the data preprocessing is essential. 
* The preprocessing helps to make the data consistent by eliminating any duplicates, irregularities in the data, normalizing the data to compare, and improving the accuracy of the results.  
* The machines understand the language of numbers, primarily binary numbers 1s and 0s. 
* Nowadays, most of the generated and available data is unstructured, meaning not in tabular form, nor having any fixed structure to the data. 
* The most consumable form of unstructured data is text, which comes in the form of tweets, posts, comments. 
* We also get data in the format of images, audio and as we can see, such data is not present in the format that can be readily ingested into a model. 
* So, for the parsing, we need to convert or transform the data so that the machine can interpret it. 
* Again to reiterate, data preprocessing is a crucial step in the Data Science process.

# Data Preprocessing Steps:

* 1.Data Cleaning
* 2.Data Integration
* 3.Data Transformation
* 4.Data Reduction
* 5.Data Splitting

# 1.Data Cleaning

# What is Data Cleaning?
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

# Assess Data quality
There are various measures of data quality such as accuracy, consistency, completeness, timeliness, and validity. The most important measure of data quality is accuracy because it tells us how close our results are to reality.
Common measures during the data quality assessment are:

    1- Completeness
    This measures how much of the data is present. For example, if you are tracking the number of visitors to a website, completeness would be the percentage of visitors that are correctly recorded.

    2- Accuracy
    Accuracy is one of the most important measure in data quality assessment as it identifies how close the data is to the true value. For example, if you are recording the temperature outside, accuracy would be how close your reading is to the actual temperature.

    3- Timeliness
    This measures how up-to-date the data is. For example, if you are tracking the stock price of a company, timeliness would be how close your data is to the current stock price.

    4- Consistency
    This measures how consistent the data is. For example, if you are tracking the number of employees at a company, consistency would be how often the data changes.

# Data anomalies
Anomaly detection (aka outlier analysis) is a step in data mining that identifies data points, events, and/or observations that deviate from a dataset’s normal behavior.
An unexpected change within these data patterns, or an event that does not conform to the expected data pattern, is considered an anomaly.

# Detect missing values with pandas dataframe functions: .info() and .isna():
Pandas dataframe.isna() function is used to detect missing values. It return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. 
Syntax: DataFrame.isna()

Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.
Syntax: DataFrame.info()

In [3]:
import pandas as pd
df = pd.read_csv('dataset/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,,3,"Heikkinen, Miss. Laina",female,,0,0,STON/O2. 3101282,,,S
3,4,,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     887 non-null    float64
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          713 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         888 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(3), int64(4), object(5)
memory usage: 83.7+ KB


In [9]:
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,True,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,True,False,False,False,True,True,False
3,False,True,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [10]:
df.isnull().sum()

PassengerId      0
Survived         4
Pclass           0
Name             0
Sex              0
Age            178
SibSp            0
Parch            0
Ticket           0
Fare             3
Cabin          687
Embarked         2
dtype: int64

In [11]:
df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,True,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,True,False,False,False,True,True,False
3,False,True,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [12]:
df.isna().sum()

PassengerId      0
Survived         4
Pclass           0
Name             0
Sex              0
Age            178
SibSp            0
Parch            0
Ticket           0
Fare             3
Cabin          687
Embarked         2
dtype: int64

In [8]:
df.notnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,True,True,True,True,True,True,True,True,True,True,False,True
1,True,False,True,True,True,True,True,True,True,True,True,True
2,True,False,True,True,True,False,True,True,True,False,False,True
3,True,False,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
886,True,True,True,True,True,True,True,True,True,True,False,True
887,True,True,True,True,True,True,True,True,True,True,True,True
888,True,True,True,True,True,False,True,True,True,True,False,True
889,True,True,True,True,True,True,True,True,True,True,True,True


In [13]:
df.notnull().sum()

PassengerId    891
Survived       887
Pclass         891
Name           891
Sex            891
Age            713
SibSp          891
Parch          891
Ticket         891
Fare           888
Cabin          204
Embarked       889
dtype: int64

# Diagnose type of missing values with visual and statistical methods (eg. chi-squared test of independence):

