## Checking the data

One of the worst things that can happen in any analysis project is finding that weeks of work are wasted and all the conclusions drawn are false because there was a problem with the data set. For example it was loaded in incorrectly or the client made a mistake when putting the data together. Running through the checks described below can save you a lot of effort.

As you proceed in your analysis it is also very useful to have a set of summary statistics against which to sense check the numbers. For example, if you are analysing customer data, then you should always keep in mind, as a benchmark, the total number of unique customers in the data set.

It helps to run through all the things that can go wrong!

Looking first at issues that might occur when loading the data

1. It might be that the data is incomplete because not all of the rows were loaded

2. Or it could be that some of the fields were given the wrong data type (e.g. a number is loaded as a text string)

3. Or perhaps the software didn’t understand some of the values and replaced them with NAs

And then there are problems that might have occurred at the source of the data

1. There might have been data entry or processing errors that resulted in wrong or missing values in some of the fields.

2. Some of the rows in the data might have been accidentally duplicated (this happens sometimes when querying database tables)

We will address each of these concerns.

Let's begin by loading each of the csvs in again.

In [None]:
import pandas as pd
working_dir  = "../data/"

household_size = pd.read_csv(working_dir + 'HouseholdSize.csv', encoding = 'ISO-8859-1')
admin_regions = pd.read_csv(working_dir + 'AdminRegions.csv', encoding = 'ISO-8859-1')
approximated_social_grade = pd.read_csv(working_dir + 'ApproximatedSocialGrade.csv', encoding = 'ISO-8859-1')
country_of_birth = pd.read_csv(working_dir + 'CountryOfBirthDetailed.csv', encoding = 'ISO-8859-1')
customer_data = pd.read_csv(working_dir + 'CustomerData.csv', encoding = 'ISO-8859-1')
postcode_to_ward_code = pd.read_csv(working_dir + 'PostcodeToWardCode.csv', encoding = 'ISO-8859-1')
household_lifestage = pd.read_csv(working_dir + 'HouseholdLifestage.csv', encoding = 'ISO-8859-1')

First let's look at the shape of the household_size data frame, that is the number of rows and columns

In [None]:
household_size.shape

This can be checked off against the original csv file to ensure that all the rows and columns have been loaded. Next let's once again inspect the data using the `head` function. Next let's once again inspect the data using the `head` function

In [None]:
household_size.head()

 There are clearly some issues. For example row three of `two_people_in_household` contains a text value. Let's investigate the data types given by pandas to the columns. 

In [None]:
household_size.dtypes

We can see that this column has been imported as text (object means text for pandas). This needs to be fixed.

In [None]:
household_size.two_people_in_household = pd.to_numeric(household_size.two_people_in_household, errors='coerce')
household_size.dtypes

Now we need to deal with NAs. We can either omit any rows containing them or we can replace them with a value.

In [None]:
household_size = household_size.dropna() #Drop them
#household_size = household_size.fillna(0) #Or replace them
household_size.shape

Duplicates are also easily excluded.

In [None]:
household_size = household_size.drop_duplicates()
household_size.shape

Finally we should check our sense check data for extreme values. This is done using the describe function.

In [None]:
household_size.describe()

By default the describe function only includes numeric variables. Use [the documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) to work out how to include categorical variables.

Finally we will use pickle to save our work.

In [None]:
household_size.to_pickle(working_dir + 'HouseholdSize.p')

Or we could just write it back as a csv

In [None]:
#household_size.to_csv(working_dir + 'HouseholdSize.csv')

Repeat the exercise with the dataframe for approximated_social_grade.