# Data Preperation
Welcome to the Data Preperation. In this lesson we will be covering: 
- **Filling missing data**
- **Skewness**
- **Outliers** 
- **Datetime**

The lab for Lesson 6 will consist of all the exercises that you will find throughtout the notebook. 

For this lesson we will again be using the Titanic Survival Dataset from Kaggle. 

Let's review the column values once more as a reminder of the data we are using:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `?`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `?`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
- **Home.Dest**: Home / Destination

In [None]:
# import libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

# Import titanic dataset
titanic_data = pd.read_csv("titanic_data.csv")
titanic_data

In [None]:
titanic_data.info()

In [None]:
titanic_data['sex']= titanic_data['sex'].map(str)

# Filling Missing Data
Lets replace our missing values in the age column with the mean value.


In [None]:
# check for missing values
titanic_data.age.value_counts()

# modify pandas display options to view the full list
pd.options.display.max_rows = 4000

In [None]:
# is '?' concidered a null value?
titanic_data.isnull().sum()

In [None]:
# replace ? with none
titanic_data = titanic_data.replace({'?': None})

# change the type fo the age column to numeric
titanic_data['age'] = pd.to_numeric(titanic_data['age'], errors = 'coerce')

# Lets round the age values
titanic_data['age'] = round(titanic_data['age'])

# lets check for null values again
titanic_data.isnull().sum()

## Using mean to fill in missing data

Lets replace missing age values with the average age of the people within the same gender

In [None]:
titanic_data['sex'].value_counts()

In [None]:
# create a dictionary with the average age for each sex
gender_ages = dict(round(titanic_data.groupby('sex')['age'].mean()))
gender_ages

In [None]:
gender_ages['male']

In [None]:
# create a column of the average ages
titanic_data['age_mean'] = titanic_data['sex'].apply(lambda x: gender_ages[x])

# replace all missing ages with the value in this column
titanic_data['age'].fillna(titanic_data['age_mean'], inplace=True)

# check the age column again
titanic_data['age'].value_counts()

## Exercise 1
Think about ways to replace the missing values in the age column. 
-  What other information can be used to replace the missing values.


## Solution
- Other ways to group passangers would be by pre-fixes, or socio-economic levels.

# Skewed Data

Creating a histogram

In [None]:
%matplotlib inline
# Create a histogram
plt.hist(titanic_data['age'], density=True, bins=25)
plt.title("Age: Feature Distribution")
plt.xlabel("Age")
plt.ylabel("No of records")


In [None]:
titanic_data.describe()

In [None]:
# Create a copy of our dataset
titanic_data_transformed = titanic_data.copy()

# applying the log transformation to the test data
titanic_data_transformed['age'] = titanic_data_transformed['age'].apply(lambda x: np.log(x+1))

# Visualize the new log distributions
plt.hist(titanic_data_transformed['age'], density=True, bins=30)
plt.title("Age: Feature Distribution")
plt.xlabel("Age")
plt.ylabel("No of records")


In [None]:
titanic_data_transformed.describe()

# Outliers

In [None]:
# calculate percentiles
age_percentiles = np.percentile(titanic_data['age'], [25, 50, 75])

# Print the result
print(age_percentiles)

- if we split our data into buckets where we have the most common ages

In [None]:
# import seaborn
import seaborn as sns


# Create box plot with Seaborn's default settings
_ = sns.boxplot(x='age',data=titanic_data)


# Label the axes
_ = plt.xlabel('age')


# Show the plot
plt.show()

In [None]:
titanic_data.describe()

In [None]:
# Remove outliers
df_no_outliers = titanic_data[titanic_data['age'] < 65]
df_no_outliers['age'].value_counts()

In [None]:
# Create box plot with Seaborn's default settings
_ = sns.boxplot(x='age',data=df_no_outliers)

# Label the axes
_ = plt.xlabel('age')


# Show the plot
plt.show()

In [None]:
# calculate new statistics
df_no_outliers.describe()

## Exercise 2
 Remove the outliers from the 'fare' feature

## Solution

In [None]:
# How many null values do we have?
titanic_data.isnull().sum()

In [None]:
# Visual inspection of the values in the fare column
titanic_data.fare

In [None]:
# Making sure the fare column is of numeric type
titanic_data['fare'] = pd.to_numeric(titanic_data['fare'], errors = 'coerce')

In [None]:
# lets look at the statistics of the fare column
titanic_data.describe()

In [None]:
%matplotlib inline
# Create a histogram
plt.hist(titanic_data['fare'], density=True, bins=25)
plt.title("fare: Feature Distribution")
plt.xlabel("fare")
plt.ylabel("No of records")


In [None]:
# Create box plot with Seaborn's default settings
_ = sns.boxplot(x='fare',data=titanic_data)

# Label the axes
_ = plt.xlabel('fare')


# Show the plot
plt.show()

In [None]:
# Removing fare outliers
df_no_outliers_fares = titanic_data[titanic_data['fare'] < 300]

In [None]:
%matplotlib inline
# Create a histogram
plt.hist(df_no_outliers_fares['fare'], density=True, bins=25)
plt.title("fare: Feature Distribution")
plt.xlabel("fare")
plt.ylabel("No of records")


In [None]:
# Create box plot with Seaborn's default settings
_ = sns.boxplot(x='fare',data=df_no_outliers_fares)

# Label the axes
_ = plt.xlabel('fare')


# Show the plot
plt.show()

## DateTime Manipulation

In [None]:
# Import datetime
import datetime as dt

# Import data with datetime (appointment data from a doctor's office)
date_df = pd.read_csv("datetime_data.csv")
date_df.head()

In [None]:
date_df.info()

- Note that date features like "AppointmentDay" and "ScheduledDay" seem to be "object" instead of date time.
- This is the case beause  pandas is currently looking at these values as strings instead of dates
- To convert these features into datetime we can simply use the pandas function to_datetime

In [None]:
# Convert AppointmentDay into datetime
date_df['AppointmentDay'] = pd.to_datetime(date_df['AppointmentDay'])

# Convert AppointmentDay into datetime
date_df['ScheduledDay'] = pd.to_datetime(date_df['AppointmentDay'])

date_df.info()

Now that you have these values with datetime type, you can use them to create a better analysis.
- datetime64 has several properties in pandas that we can now use 

link:https://pandas.pydata.org/pandas-docs/version/0.23/api.html#datetimelike-properties

In [None]:
# Create a new field that tells us the year in which each appointment was set
date_df['year_set'] = date_df['ScheduledDay'].dt.year

# Create a new field that tells us which day of the year the appointment was set
date_df['day_num'] = date_df['ScheduledDay'].dt.dayofyear

date_df.sort_values(by=['Age']).head(26)

## Exercise 3
 Find the week of the year corresponding to the 'ScheduleDay' feature
 (Hint: use the link above to find the method necessary to solve this exercise)

## Solution

In [None]:
date_df['week_num'] = date_df['ScheduledDay'].dt.weekofyear

In [None]:
date_df.sort_values(by=['AppointmentID']).head(26)