# Data QA and Cleaning
Before we can do any meaningful analysis of any data, we need to make sure that it is the ata that we need and take care of any "bad" data that could be problematic. Some common things that we're looking for in data Quality Assurance (QA) are:
* Consistent data types
* Missing and N/A values
* Unreasonable outliers/values
* Duplicates

In [None]:
# import necessary libraries
import pandas as pd # for data frames, reading and writing data
from matplotlib import pyplot as plt
import numpy as np
from scipy import stats

# the next line is so that the matplot lib plots show up in the notebook cell
%matplotlib inline

## Load Data
Let's use the same sample data that we used before in the Pandas section. We'll load the user data, since that has the most fields and potential for "dirty" data.

In [None]:
filename = 'sample_data.xlsx'
user_df = 
user_df.head()

In [None]:
# Review the data with the `describe` function


## NA Columns
Since the describe function doesn't give us the count of NA values, we can easily get that with a call to the ".isna()" function and then sum those results:

In [None]:
user_df.isna().sum()

Let's say we wanted to create a model using the text from `description` as the inputs to the model, we'd want to drop all of the records with NA descriptions. We may also want to drop any descriptions with a length shorter than n-words.

Let's create a subset that has descriptions with more than 2 words.

In [None]:
# Add a column with the description word-count (we'll use list comprehension 
# and the 'split' function for that

user_df['desc_len'] = [   ]

# Filter the data set to those descriptions with non-NA and length >2.
user_sub = user_df.loc[  ]
len(user_sub)

## Data Types
Sometimes imported data doesn't arrive as the data type that you need. Most often this happens with dates, but sometimes numbers come in as strings too, causing problems. 

Use the `dtypes` method on a pandas dataframe to get the list of data types.

In [None]:
user_sub.dtypes

Looks like the created_at date came in as a date, so no need to change it. Let's create a string version of that variable and then convert it back...

We can use the .astype() function to convert an entire pandas.Series to a different type.

In [None]:
user_sub['created_at_str'] = 

Assuming we got this data with the created_at_str date as a string and we want to create a date-only column.

In [None]:
converted_dates = pd.to_datetime(user_sub.created_at_str, format='%Y-%m-%d')

# Test to see if the convertion matches the original data:
converted_dates.equals(user_sub['created_at'])   

In [None]:
# Make a date only column
user_sub.loc[:, 'created_date'] = 
user_sub.loc[:, 'created_month'] = 

## Quick EDA
Pandas has connected to some of the more basic MatPlotLib plotting functionalit
y. This makes it easier to create quick plots of dataframe data. With our new created_date, let's create a quick historgram to see how many tweets are created/day.


In [None]:
# Histogram of `created_month`


In [None]:
# Value counts of created_month to compare with the histogram


## Merging and Appending Data
We covered merging and appending data in the Pandas notebook. Here are some stumbling blocks that I've run into when trying to merge data and ways around them:
1. Data types - if the column data types don't match, Pandas won't merge the data. Sometimes even when you call the `.astype()` on a column, say to convert it from float to int, it won't work if there is a non-integer value in there.
    Solution - wrestle with both data frames until you can get the datatypes to match. I've had to keep integer fields as floats, or change dates to strings in order for a merge to work.
1. Date-data - there are a few different data types (pandas.datetime, regular datetime, others?) they have to match for a merge to work.
1. Indices - if you are just appending data, you may need to add `ignore_index=True` for the append to work, especially if there are matching indices in the two data frames.
1. Extra columns - these automatically get added when you append two data frames, with the resulting data frame having all of the columns existing in both.
1. Column order - this seems to get messed up sometimes when you append or merge data frames. It's easy to fix by simply redefining the data set by passing in the list of columns in the order that you need them.

## Imputing Missing Values
There are many different options when looking to fill in missing values. Some common methods:
1. Drop the data
1. Fill with zeros
1. Fill with the population mean
1. Fill with a grouped mean (mean of a subset that each missing data point belongs to)

All of these can be handled by subsetting the data frames and applying the logic that you want to fill the variable. 

We've already shown how to drop the data above when we dropped rows that had insufficient data in the `description` column for our analysis.

To create some examples, let's create some holes in the user data that we've been working with.

In [None]:
# Take a random sample from the data and set the values to na
sample_idx = user_df.sample(n=40, random_state=24).index 
user_df.loc[sample_idx, ['followers_count', 'friends_count', 'favourites_count']] = np.nan

In [None]:
# Use the function from earlier to check our new NA counts
user_df[['followers_count', 'friends_count', 'favourites_count']].isna().sum()

### 1. Fill values with 0

### 2. Fill values with population mean

### 3. Fill values with grouped mean
This one is a little tougher, since we have to calculate means for each group. We can create a grouped data frame to get those means, then merge them to update the NA values. Let's use `time_zone` as our grouping variable and populatet the `favourites_count` with the average within the time-zone.

First, let's take a look at how many unique time_zone values we have:

In [None]:
# use .value_counts to look at unique time_zone counts


Clearly we need to do some clean-up on these timezones. We can create a 'grouped_tz' column that combined the similar time zones and/or groups all the stragglers into one value. Rather than overwrite the time_zone variable, we'll use this new variable so that none of the data is lost in the process.

In [None]:
# There are a lot of missing values! Let's mark those as missing for now:
user_df.loc[user_df.time_zone.isna(), 'time_zone'] = 'Missing'

# Create a new value for the grouped time zone so we don't lose the original data.
user_df['grouped_tz'] = 'Other'
user_df.loc[user_df.time_zone=='missing', 'grouped_tz'] = 'Missing'

#Europe
user_df.loc[user_df.time_zone.isin(['London', 'Dublin','Edinburgh','Amsterdam','Stockholm','Lisbon']),
            'grouped_tz'] = 'Europe'
# Eastern & Atlantic
user_df.loc[user_df.time_zone.isin(['Eastern Time (US & Canada)', 'America/New_York','Indiana (East)', 'Atlantic Time (Canada)']),
            'grouped_tz'] = 'Eastern'
# Central
user_df.loc[user_df.time_zone.isin(['Central Time (US & Canada)', 'America/Toronto']),
            'grouped_tz'] = 'Central'
# Pacific, Alaska, Hawaii
user_df.loc[user_df.time_zone.isin(['Pacific Time (US & Canada)', 'America/Los_Angeles','Alaska','Hawaii']),
            'grouped_tz'] = 'Pacific'

In [None]:
# Create a groupby object from the user_df with only grouped_tz and favorites_count columns
# group by grouped_tz, use as_index=False to flatten the multi-index index
# aggregate to get the means
tz_means = user_df[ ].groupby('grouped_tz', as_index=False).agg('mean') 

# rename the mean column to 'favourites_mean' avoid a conflict when updating
tz_means.columns = [  ]
tz_means.head()

In [None]:
# Merge the user_df with the tz_means on the time_zone value - left outer join style so we don't lose any user_df values
user_df_merged = user_df.merge(  )

# Check the results
user_df_merged.head()

In [None]:
# Update the missing friends_counts with the new friends_mean:
user_df_merged.loc[[update conditions] , 'favourites_count'] = user_df_merged.loc[  ]

### Check the Data

In [None]:
user_df_merged.loc[sample_idx, ['time_zone', 'favorites_mean', 'followers_count', 'friends_count', 'favourites_count']]

## Identifying Outliers
Sometimes we want to identify outliers in a particular data field and review them before including them in the analysis. Similar to the treatment of missing values, we may want to:
1. Remove the outliers, or
2. Cap the outliers at some maximum (or minimum) value

Again let's look at the followers_count and review the percentiles for each of the users in our users_df dataframe. We can use the scipy stats module to calculate the percentile for each value to identify the outliers.

In [None]:
user_df_followers = user_df[['id','followers_count']].sort_values('followers_count')

In [None]:
# Calculate the mean and standard deviation
fc_mean = 
fc_sd = 
print('mean = {:.2f}\nstandard deviation = {:.2f}'.format(fc_mean, fc_sd))

In [None]:
user_df_followers['percentile'] = stats.norm.cdf(user_df_followers['followers_count'], fc_mean, fc_sd)
user_df_followers.head()

Because the standard deviation is so large, many of the smaller values have cdf values close to 50%. Let's take a look at all of the values over the 95th percentile:

In [None]:
# subset user_df_followers for those rows where the percentile > 0.95.
user_df_followers.loc[  ]

Let's create a new column for a capped follower_count so we don't lose the original data. We'll cap the value at the 95th percentile value. We can use the `stats.norm.ppf` function to get the value f the 95th percentile, effectively a reverse cdf calculation.

In [None]:
capped_value = round(stats.norm.ppf(.95, fc_mean, fc_sd),0)
user_df_followers['capped_followers'] = [ "list comprehension to use actual value below cap and cap value above" ]   ]

In [None]:
# Check to see if it worked:
user_df_followers.loc[user_df_followers['percentile']>0.85]