# Data Preprocessing

Most important part of creating a model is to have a sound business nowledge of the problem you are trying to solve. 

## Types of research
1. Primary research: Done by yourself
    * **Discussions:** Ask questions and gather information from the stakeholders.
    * **Dry Run:** If possible take a dry run of the problem you are trying to investigate

2. Secondary research: Done by others
    * **Reports and Studies:** Read reports and studies by government agencies, trade associations or other businesses in your industry.
    * **Previous works:** Go through any previous work and findings related to your problem.

### Example:
Cart abondonment Analysis

Problem: high fractions of your online customer are adding product to their cart but not purchasing it.

**Business knowledge that will be helpful**
1. Discussions with marketing team (primary research)
2. Discussions with the product team (primary research)
3. Dry run of the online purchasing process to understand customer journey (primary research)
4. Research on industry reports regarding cart abondonment (secondary research)
5. Any previous work in your/ohter organization regarding cart abondonment (secondary research)

## Data Exploration
Next step should be to use the acquired business knowledge to search for relevant data

**Steps:**
1. Identify data need.
2. Plan data request.
3. Quality check.

* Data requested types:
1. **Internal Data:** Data collected by your organization. E.G.: Usage data, sales data, promotion data
2. **External Data:** Data acquired from external data sources. E.G.: Census data, external vendor data, scrape data

### Example Cont.:
1. Input from the marketing team: Our 50% comes from e-mail marketing, 30% from organic search and rest 20% from ad word marketing -> Gather the source website data for all costumers.
2. Input from the product team: We have 3 step purchase process - Cart reviewe, address/personal detail, payment. -> Gather the cart abondonment location for all customer.
3. Input from industry reports regarding cart abandonment: Custoers tend to put high value item for long duration in their carts. -> Gahter the data about total cart value of all customers
4. Input from dry run: Encountered a survey link for ate website experience. -> Gather survey data for all customers.

## Data Dictionary
This is helpful to understand the data. You should know variable definition and distribution along with table's unique identifiers and foreign keys. 

**A comprehensive Data Dictionary should include:**
1. Definition of predictors.
2. Unique identifier of each table (or primary keys).
3. foreign keys or matching keys between tables.
4. Explanation of values in case of categorical variables.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

dataset = pd.read_csv('House_Price.csv', header = 0)

In [3]:
dataset.head()

Unnamed: 0,price,crime_rate,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,airport,n_hos_beds,n_hot_rooms,waterbody,rainfall,bus_ter,parks
0,24.0,0.00632,32.31,0.538,6.575,65.2,4.35,3.81,4.18,4.01,24.7,4.98,YES,5.48,11.192,River,23,YES,0.049347
1,21.6,0.02731,37.07,0.469,6.421,78.9,4.99,4.7,5.12,5.06,22.2,9.14,NO,7.332,12.1728,Lake,42,YES,0.046146
2,34.7,0.02729,37.07,0.469,7.185,61.1,5.03,4.86,5.01,4.97,22.2,4.03,NO,7.394,101.12,,38,YES,0.045764
3,33.4,0.03237,32.18,0.458,6.998,45.8,6.21,5.93,6.16,5.96,21.3,2.94,YES,9.268,11.2672,Lake,45,YES,0.047151
4,36.2,0.06905,32.18,0.458,7.147,54.2,6.16,5.86,6.37,5.86,21.3,5.33,NO,8.824,11.2896,Lake,55,YES,0.039474


In [6]:
dataset.shape

(506, 19)

## Univariate Analysis

Univariate analysis is the simplest form of analyzing data. "Uni" means "one", so in other words your data has only one variable. It doesn't deal with causes or relationships (unlike regression) and it's major purpose is to describe; it takes data, summarizes that data and finds patters in the data.

**Ways to describe patterns found in univariate data:**
1. Central tendency
    * Mean
    * Mode
    * Median
2. Dispersion
    * Range
    * Variance
    * Maximum, minimum
    * Quartiles (including the interquartile range)
    * Standard deviation
3. Count/Null count

In [7]:
dataset.describe()

Unnamed: 0,price,crime_rate,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,n_hos_beds,n_hot_rooms,rainfall,parks
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,498.0,506.0,506.0,506.0
mean,22.528854,3.613524,41.136779,0.554695,6.284634,68.574901,3.971996,3.628775,3.960672,3.618972,21.544466,12.653063,7.899767,13.041605,39.181818,0.054454
std,9.182176,8.601545,6.860353,0.115878,0.702617,28.148861,2.108532,2.10858,2.119797,2.099203,2.164946,7.141062,1.476683,5.238957,12.513697,0.010632
min,5.0,0.00632,30.46,0.385,3.561,2.9,1.13,0.92,1.15,0.73,18.0,1.73,5.268,10.0576,3.0,0.033292
25%,17.025,0.082045,35.19,0.449,5.8855,45.025,2.27,1.94,2.2325,1.94,19.8,6.95,6.6345,11.1898,28.0,0.046464
50%,21.2,0.25651,39.69,0.538,6.2085,77.5,3.385,3.01,3.375,3.07,20.95,11.36,7.999,12.72,39.0,0.053507
75%,25.0,3.677083,48.1,0.624,6.6235,94.075,5.3675,4.9925,5.4075,4.985,22.6,16.955,9.088,14.1708,50.0,0.061397
max,50.0,88.9762,57.74,0.871,8.78,100.0,12.32,11.93,12.32,11.94,27.4,37.97,10.876,101.12,60.0,0.086711
