# Cleaning Data in Python

## Exploring Your Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from glob import glob
import re

In [None]:
pd.set_option('max_columns', 100)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

### Loading and viewing your data

In this chapter, you're going to look at a subset of the Department of Buildings Job Application Filings dataset from the NYC Open Data portal. This dataset consists of job applications filed on January 22, 2017.

Your first task is to load this dataset into a DataFrame and then inspect it using the .head() and .tail() methods. However, you'll find out very quickly that the printed results don't allow you to see everything you need, since there are too many columns. Therefore, you need to look at the data in another way.

The .shape and .columns attributes let you see the shape of the DataFrame and obtain a list of its columns. From here, you can see which columns are relevant to the questions you'd like to ask of the data. To this end, a new DataFrame, df_subset, consisting only of these relevant columns, has been pre-loaded. This is the DataFrame you'll work with in the rest of the chapter.

Get acquainted with the dataset now by exploring it with pandas! This initial exploratory analysis is a crucial first step of data cleaning.

Data can be downloaded from here:
https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2/data

In [None]:
# Read the file into a DataFrame: df
df = pd.read_csv('data/DOB_Job_Application_Filings.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

In [None]:
# Create Subset
df_subset = df.loc[(df['Pre- Filing Date'] == '01/22/2017')].copy()

# Print the head and tail of df_subset
print(df_subset.head())
print(df_subset.tail())

### Further diagnosis

In the previous exercise, you identified some potentially unclean or missing data. Now, you'll continue to diagnose your data with the very useful .info() method.

The .info() method provides important information about a DataFrame, such as the number of rows, number of columns, number of non-missing values in each column, and the data type stored in each column. This is the kind of information that will allow you to confirm whether the 'Initial Cost' and 'Total Est. Fee' columns are numeric or strings. From the results, you'll also be able to see whether or not all columns have complete data in them.

The full DataFrame df and the subset DataFrame df_subset have been pre-loaded. Your task is to use the .info() method on these and analyze the results.

In [None]:
print(len(df))
print(len(df_subset))

# Print the info of df
print(df.info())

# Print the info of df_subset
print(df_subset.info())

Excellent! Notice that the columns 'Initial Cost' and 'Total Est. Fee' are of type object. The currency sign in the beginning of each value in these columns needs to be removed, and the columns need to be converted to numeric. In the full DataFrame, note that there are a lot of missing values. You saw in the previous exercise that there are also a lot of 0 values. Given the amount of data that is missing in the full dataset, it's highly likely that these 0 values represent missing data.

### Calculating summary statistics

You'll now use the .describe() method to calculate summary statistics of your data.

In this exercise, the columns 'Initial Cost' and 'Total Est. Fee' have been cleaned up for you. That is, the dollar sign has been removed and they have been converted into two new numeric columns: initial_cost and total_est_fee. You'll learn how to do this yourself in later chapters. It's also worth noting that some columns such as Job # are encoded as numeric columns, but it does not make sense to compute summary statistics for such columns.

This cleaned DataFrame has been pre-loaded as df. Your job is to use the .describe() method on it in the IPython Shell and select the statement below that is False.

In [None]:
df_subset['Initial Cost'] = df_subset['Initial Cost'].str.replace(',', '')
df_subset['Total Est. Fee'] = df_subset['Total Est. Fee'].str.replace(',', '')

In [None]:
df_subset['Initial Cost'] = df_subset['Initial Cost'].str.replace('$', '').astype('float')
df_subset['Total Est. Fee'] = df_subset['Total Est. Fee'].str.replace('$', '').astype('float')

In [None]:
df_subset['Initial Cost']

In [None]:
df_subset['Total Est. Fee']

In [None]:
df_subset.describe()

### Frequency counts for categorical data

As you've seen, .describe() can only be used on numeric columns. So how can you diagnose data issues when you have categorical data? One way is by using the .value_counts() method, which returns the frequency counts for each unique value in a column!

This method also has an optional parameter called dropna which is True by default. What this means is if you have missing data in a column, it will not give a frequency count of them. You want to set the dropna column to False so if there are missing values in a column, it will give you the frequency counts.

In this exercise, you're going to look at the 'Borough', 'State', and 'Site Fill' columns to make sure all the values in there are valid. When looking at the output, do a sanity check: Are all values in the 'State' column from NY, for example? Since the dataset consists of applications filed in NY, you would expect this to be the case.

In [None]:
# Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))

In [None]:
# Print the value_counts for 'State'
print(df['State'].value_counts(dropna=False))

In [None]:
# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))

Fantastic work! Notice how not all values in the 'State' column are NY. This is an interesting find, as this data is supposed to consist of applications filed in NYC. Curiously, all the 'Borough' values are correct. A good start as to why this may be the case would be to find and look at the codebook for this dataset. Also, for the 'Site Fill' column, you may or may not need to recode the NOT APPLICABLE values to NaN in your final analysis.

### Visualizing single variables with histograms

Up until now, you've been looking at descriptive statistics of your data. One of the best ways to confirm what the numbers are telling you is to plot and visualize the data.

You'll start by visualizing single variables using a histogram for numeric values. The column you will work on in this exercise is 'Existing Zoning Sqft'.

The .plot() method allows you to create a plot of each column of a DataFrame. The kind parameter allows you to specify the type of plot to use - kind='hist', for example, plots a histogram.

In the IPython Shell, begin by computing summary statistics for the 'Existing Zoning Sqft' column using the .describe() method. You'll notice that there are extremely large differences between the min and max values, and the plot will need to be adjusted accordingly. In such cases, it's good to look at the plot on a log scale. The keyword arguments logx=True or logy=True can be passed in to .plot() depending on which axis you want to rescale.

Finally, note that Python will render a plot such that the axis will hold all the information. That is, if you end up with large amounts of whitespace in your plot, it indicates counts or values too small to render.

In [None]:
# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True, figsize=(10, 8), edgecolor='black')

# Display the histogram
plt.show()

Excellent work! While visualizing your data is a great way to understand it, keep in mind that no one technique is better than another. As you saw here, you still needed to look at the summary statistics to help understand your data better. You expected a large amount of counts on the left side of the plot because the 25th, 50th, and 75th percentiles have a value of 0. The plot shows us that there are barely any counts near the max value, signifying an outlier.

### Visualizing multiple variables with boxplots

Histograms are great ways of visualizing single variables. To visualize multiple variables, boxplots are useful, especially when one of the variables is categorical.

In this exercise, your job is to use a boxplot to compare the 'initial_cost' across the different values of the 'Borough' column. The pandas .boxplot() method is a quick way to do this, in which you have to specify the column and by parameters. Here, you want to visualize how 'initial_cost' varies by 'Borough'.

pandas and matplotlib.pyplot have been imported for you as pd and plt, respectively, and the DataFrame has been pre-loaded as df.

In [None]:
df['initial_cost'] = df['Initial Cost'].str.replace(',', '')
df['total_est_fee'] = df['Total Est. Fee'].str.replace(',', '')
df['initial_cost'] = df['Initial Cost'].str.replace('$', '').astype('float')
df['total_est_fee'] = df['Total Est. Fee'].str.replace('$', '').astype('float')

In [None]:
# Create the boxplot
df.boxplot(column='initial_cost', by='Borough', rot=90, figsize=(10, 8))

# Display the plot
plt.show()

Great work! You can see the 2 extreme outliers are in the borough of Manhattan. An initial guess could be that since land in Manhattan is extremely expensive, these outliers may be valid data points. Again, further investigation is needed to determine whether or not you can drop or keep those points in your data.

### Visualizing multiple variables with scatter plots

Boxplots are great when you have a numeric column that you want to compare across different categories. When you want to visualize two numeric columns, scatter plots are ideal.

In this exercise, your job is to make a scatter plot with 'initial_cost' on the x-axis and the 'total_est_fee' on the y-axis. You can do this by using the DataFrame .plot() method with kind='scatter'. You'll notice right away that there are 2 major outliers shown in the plots.

Since these outliers dominate the plot, an additional DataFrame, df_subset, has been provided, in which some of the extreme values have been removed. After making a scatter plot using this, you'll find some interesting patterns here that would not have been seen by looking at summary statistics or 1 variable plots.

When you're done, you can cycle between the two plots by clicking the 'Previous Plot' and 'Next Plot' buttons below the plot.

In [None]:
# Create and display the first scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70, figsize=(10, 8))
plt.show()

Excellent work! In general, from the second plot it seems like there is a strong correlation between 'initial_cost' and 'total_est_fee'. In addition, take note of the large number of points that have an 'initial_cost' of 0. It is difficult to infer any trends from the first plot because it is dominated by the outliers.

## Tidying Data for Analysis
### Tidy data

* **"Tidy Data"** paper by Hadley Wickham, PhD

#### Principles of tidy data

1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form tables

#### Converting to tidy data

index | name | treatment a | treatment b
--- | --- | --- | ---
0 | Daniel | - | 42
1 | John | 12 | 31
2 | Jane | 24 | 27


index | name | treatment | value
--- | --- | --- | ---
0 | Daniel | treatment a | -
1 | John | treatment a | 12
2 | Jane | treatment a | 24
3 | Daniel | treatment b | 42
4 | John | treatment b | 31
5 | Jane | treatment b | 27


* Better for reporting vs better for analysis
* Tidy data makes it easier to fix common data problems

### Recognizing tidy data

For data to be tidy, it must have:

Each variable as a separate column.
Each row as a separate observation.
As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

In this exercise, two example datasets have been pre-loaded into the DataFrames df1 and df2. Only one of them is tidy. Your job is to explore these further in the IPython Shell and identify the one that is not tidy, and why it is not tidy.

In the rest of this course, you will frequently be asked to explore the structure of DataFrames in the IPython Shell prior to performing different operations on them. Doing this will not only strengthen your comprehension of the data cleaning concepts covered in this course, but will also help you realize and take advantage of the relationship between working in the Shell and in the script.

In [None]:
df1 = pd.read_csv('data/df1_recognizing_tidy_data.csv')

In [None]:
df1.head()

In [None]:
df2 = pd.read_csv('data/df2_recognizing_tidy_data.csv')

In [None]:
df2['value'] = df2['value'].replace('#NUM!', 'NaN')

In [None]:
df2.head()

Exactly! Notice that the variable column of df2 contains the values Solar.R, Ozone, Temp, and Wind. For it to be tidy, these should all be in separate columns, as in df1.

### Reshaping your data using melt

Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently.

In this exercise, you will practice melting a DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame airquality has been pre-loaded. Your job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

In [None]:
# Melt airquality (df1): airquality_melt
airquality_melt = pd.melt(df1, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())

Well done! This exercise demonstrates that melting a DataFrame is not always appropriate if you want to make it tidy. You may have to perform other transformations depending on how your data is represented.

### Customizing melted data

When melting DataFrames, it would be better to have column names more meaningful than variable and value.

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter. You will now practice doing exactly this. The DataFrame airquality has been pre-loaded for you.

In [None]:
# Melt airquality (df1): airquality_melt
airquality_melt = pd.melt(df1, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())

Great work! The DataFrame is more informative now. In the next video, you'll learn about pivoting, which is the opposite of melting. You'll then be able to convert this DataFrame back into its original, tidy, form!

### Pivoting data
#### Pivot: un-melting data

* Opposite of melting
* In melting, we turned columns into rows
* Pivoting: turn unique values into separate columns
* Analysis friendly shape to reporting friendly shape
* Violates tidy data principle: rows contain observations
    * Multiple variables stored in the same column

### Pivot data

Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before you melted it? You'll now begin pivoting it back into that form using the .pivot_table() method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

.pivot_table() has an index parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that you have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted). The melted DataFrame airquality_melt has been pre-loaded for you.

In [None]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())

Excellent work! Notice that the pivoted DataFrame does not actually look like the original DataFrame. In the next exercise, you'll turn this pivoted DataFrame back into its original form.

### Resetting the index of a DataFrame

After pivoting airquality_melt in the previous exercise, you didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a [hierarchical index (also known as a MultiIndex)](http://pandas.pydata.org/pandas-docs/stable/advanced.html).

Hierarchical indexes are covered in depth in [Manipulating DataFrames with pandas](https://www.datacamp.com/courses/manipulating-dataframes-with-pandas). In essence, they allow you to group columns or rows by another variable - in this case, by 'Month' as well as 'Day'.

There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: .reset_index(). Dan didn't show you how to use this method in the video, but you're now going to practice using it in this exercise to get back the original DataFrame from airquality_pivot, which has been pre-loaded.

In [None]:
# Print the index of airquality_pivot
print(airquality_pivot.index)

In [None]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)

In [None]:
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())

### Pivoting duplicate values

So far, you've used the .pivot_table() method when there are multiple index values you want to hold constant during a pivot. In the video, Dan showed you how you can also use pivot tables to deal with duplicate values by providing an aggregation function through the aggfunc parameter. Here, you're going to combine both these uses of pivot tables.

Let's say your data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as airquality_dup. In addition, the airquality_melt DataFrame from the previous exercise has been pre-loaded. Explore their shapes in the IPython Shell by accessing their .shape attributes to confirm the duplicate rows present in airquality_dup.

You'll see that by using .pivot_table() and the aggfunc parameter, you can not only reshape your data, but also remove duplicates. Finally, you can then flatten the columns of the pivoted DataFrame using .reset_index().

NumPy and pandas have been imported as np and pd respectively.

In [None]:
airquality_dup = pd.read_csv('data/airquality_dup.csv')
airquality_dup.head()

In [None]:
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean)
airquality_pivot = airquality_pivot.reset_index()
print(airquality_pivot.head())

In [None]:
print(df1.head())

### Beyond melt and pivot
### Splitting a column with .str

The dataset you saw in the video, consisting of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as tb.

In this exercise, you're going to tidy the 'm014' column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object.

Begin by printing the columns of tb in the IPython Shell using its .columns attribute, and take note of the problematic column.

In [None]:
tb = pd.read_csv('data/tb.csv')
tb.head()

In [None]:
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
print(tb_melt.head())

In [None]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
print(tb_melt.head())

In [None]:
# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
print(tb_melt.head())

Superb! Notice the new 'gender' and 'age_group' columns you created. It is vital to be able to split columns as needed so you can access the data that is relevant to your question.

### Splitting a column with .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. You'll learn how to deal with such cases in this exercise, using a [dataset consisting of Ebola cases and death counts by state and country](https://data.humdata.org/dataset/ebola-cases-2014). It has been pre-loaded into a DataFrame as ebola.

Print the columns of ebola in the IPython Shell using ebola.columns. Notice that the data has column names such as Cases_Guinea and Deaths_Guinea. Here, the underscore _ serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, you cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called .split(). By default, this method will split a string into parts separated by a space. However, in this case you want it to split by an underscore. You can do this on Cases_Guinea, for example, using Cases_Guinea.split('_'), which returns the list ['Cases', 'Guinea'].

The next challenge is to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. You can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part you want.

In [None]:
ebola = pd.read_csv('data/ebola.csv')
print(ebola.head())

In [None]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
print(ebola_melt.head())

In [None]:
ebola_melt.type_country.unique()

In [None]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt['type_country'].str.split('_')
ebola_melt['str_split'].head()

In [None]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt['str_split'].str.get(0)

In [None]:
# Create the 'country' column
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)

In [None]:
# Print the head of ebola_melt
print(ebola_melt.head())

## Combining Data for Analysis
### Concatenating Data

### Combining rows of data

The dataset you'll be working with here relates to [NYC Uber data](http://data.beta.nyc/dataset/uber-trip-data-foiled-apr-sep-2014). The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, you'll be working with a very small portion of the actual data.

Three DataFrames have been pre-loaded: uber1, which contains data for April 2014, uber2, which contains data for May 2014, and uber3, which contains data for June 2014. Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

Begin by exploring the structure of these three DataFrames in the IPython Shell using methods such as .head().

In [None]:
uber1 = pd.read_csv('data/uber1.csv')
uber2 = pd.read_csv('data/uber2.csv')
uber3 = pd.read_csv('data/uber3.csv')

In [None]:
print(uber1.head())
print(uber2.head())
print(uber3.head())

In [None]:
row_concat = pd.concat([uber1, uber2, uber3])

In [None]:
print(row_concat.shape)
print(row_concat.head())
print(row_concat.tail())

### Combining columns of data

Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

You'll return to the Ebola dataset you worked with briefly in the last chapter. It has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the status and country of a patient is contained in a single column. This column has been parsed into a new DataFrame, status_country, where there are separate columns for status and country.

Explore the ebola_melt and status_country DataFrames in the IPython Shell. Your job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

In [None]:
status_country = ebola_melt[['type', 'country']].copy()

In [None]:
ebola_melt_2 = ebola_melt[['Date', 'Day', 'type_country', 'counts']].copy()

In [None]:
print(status_country.head())
print(ebola_melt_2.head())

In [None]:
ebola_tidy = pd.concat([ebola_melt_2, status_country], axis=1)

In [None]:
print(ebola_tidy.head())

### Finding and Concatenating Data

### Finding files that match a pattern

You're now going to practice using the glob module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.

As Dan showed you in the video, the glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is part_ single digit number .csv, you can write the pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [None]:
csv_files = glob('data/uber?.csv')

In [None]:
print(csv_files)

### Iterating and concatenating all matches

Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to iterate through each of the filenames, read each filename into a DataFrame, and then append it to the frames list.

You can then concatenate this list of DataFrames using pd.concat(). Go for it!

In [None]:
list_data = []
for filename in csv_files:
    data = pd.read_csv(filename)
    list_data.append(data)

In [None]:
uber_new = pd.concat(list_data)

In [None]:
print(uber_new.head())

### Merge Data

```python
pd.merge(left=state_populations, right=state_codes, on=None, left_on='state', right_on='name')
```

use the 'on' parameter if the columns from the DataFrames are spelled the same

**Types of Merges**
* One-to-one
* Many-to-one / One-to-many
* Many-to-many

### 1-to-1 data merge

Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the [Software Carpentry SQL lesson](http://swcarpentry.github.io/sql-novice-survey/).

Two DataFrames have been pre-loaded: site and visited. Explore them in the IPython Shell and take note of their structure and column names. Your task is to perform a 1-to-1 merge of these two DataFrames using the 'name' column of site and the 'site' column of visited.

In [None]:
site = pd.DataFrame(data=[['DR-1', -49.85, -128.57],
                          ['DR-3', -47.15, -126.72],
                          ['MSK-4', -48.87, -123.40]],
                    index=range(0,3), columns=['name', 'lat', 'long'])

In [None]:
site

In [None]:
visited = pd.DataFrame(data=[[619, 'DR-1', '1927-02-08'],
                             [734, 'DR-3', '1939-01-07'],
                             [837, 'MSK-4', '1932-01-14']],
                       index=range(0,3), columns=['ident', 'site', 'dated'])

In [None]:
visited

In [None]:
o2o = pd.merge(left=site, right=visited, on=None, left_on='name', right_on='site')

In [None]:
o2o

Superb! Notice the 1-to-1 correspondence between the name column of the site DataFrame and the site column of the visited DataFrame. This is what made the 1-to-1 merge possible.

### Many-to-1 data merge

In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique.

Here, the two DataFrames site and visited have been pre-loaded once again. Note that this time, visited has multiple entries for the site column. Confirm this by exploring it in the IPython Shell.

The .merge() method call is the same as the 1-to-1 merge from the previous exercise, but the data and output will be different.

In [None]:
visited_new = pd.DataFrame(data=[[622, 'DR-1', '1927-02-10'],
                                 [735, 'DR-3', '1930-01-12'],
                                 [751, 'DR-3', '1930-02-26'],
                                 [752, 'DR-3', 'NaN'],
                                 [844, 'DR-1', '1932-03-22']],
                           index=range(0,5), columns=['ident', 'site', 'dated'])

In [None]:
visited_new

In [None]:
visited = pd.concat([visited, visited_new]).reset_index(drop=True)

In [None]:
visited

In [None]:
m2o = pd.merge(left=site, right=visited, on=None, left_on='name', right_on='site')

In [None]:
m2o

### Many-to-many data merge

The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Two example DataFrames that share common key values have been pre-loaded: df1 and df2. Another DataFrame df3, which is the result of df1 merged with df2, has been pre-loaded. All three DataFrames have been printed - look at the output and notice how pairwise combinations have been created. This example is to help you develop your intuition for many-to-many merges.

Here, you'll work with the site and visited DataFrames from before, and a new survey DataFrame. Your task is to merge site and visited as you did in the earlier exercises. You will then merge this merged DataFrame with survey.

Begin by exploring the site, visited, and survey DataFrames in the IPython Shell.

In [None]:
df1 = pd.DataFrame(data=[['a', 1], ['a', 2], ['b', 3], ['b', 4]], index=range(0,4), columns=['c1', 'c2'])

In [None]:
df2 = pd.DataFrame(data=[['a', 10], ['a', 20], ['b', 30], ['b', 40]], index=range(0,4), columns=['c1', 'c2'])

In [None]:
df3 = pd.DataFrame(data=[['a', 1, 10], ['a', 1, 20], ['a', 2, 10], ['a', 2, 20], ['b', 3, 30], ['b', 3, 40], ['b', 4, 30], ['b', 4, 40]], index=range(0,8), columns=['c1', 'c2_x', 'c2_y'])

In [None]:
survey = pd.read_csv('data/survey.csv')
survey.head()

In [None]:
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')
m2m

In [None]:
# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')
m2m.head()

## Cleaning Data for Analysis
### Data Types

#### Converting data types

In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.

The [tips dataset](https://github.com/mwaskom/seaborn-data/blob/master/tips.csv) has been loaded into a DataFrame called tips. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

Look at the output of tips.info() in the IPython Shell. You'll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type category and note the reduced memory usage.

In [None]:
path_to_data = r'E:\Users\Trenton J. McKinney\PycharmProjects\DataCamp\DataCamp-master\10-cleaning-data-in-python\_datasets\tips.csv'

In [None]:
tips = pd.read_csv(path_to_data)

In [None]:
tips.sex = tips.sex.astype('category')

In [None]:
tips.smoker = tips.smoker.astype('category')

In [None]:
print(tips.info())

In [None]:
tips.head()

Excellent! By converting sex and smoker to categorical variables, the memory usage of the DataFrame went down from 13.4 KB to 10.1KB. This may seem like a small difference here, but when you're dealing with large datasets, the reduction in memory usage can be very significant!

#### Working with numeric data

If you expect the data type of a column to be numeric (int or float), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data.

You can use the pd.to_numeric() function to convert a column into a numeric data type. If the function raises an error, you can be sure that there is a bad value within the column. You can either use the techniques you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore or coerce the value into a missing value, NaN.

A modified version of the tips dataset has been pre-loaded into a DataFrame called tips. For instructional purposes, it has been pre-processed to introduce some 'bad' data for you to clean. Use the .info() method to explore this. You'll note that the total_bill and tip columns, which should be numeric, are instead of type object. Your job is to fix this.

In [None]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

In [None]:
# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

In [None]:
# Print the info of tips
print(tips.info())

Great work! The 'total_bill' and 'tip' columns in this DataFrame are stored as object types because the string 'missing' is used in these columns to encode missing values. By coercing the values into a numeric type, they become proper NaN values.

### Using Regular Expressions to Clean Strings

![alt text](images/regex_example.JPG "Regex Example")

```python
inport re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')
bool(result)
```

#### String parsing with regular expressions

In the video, Dan introduced you to the basics of regular expressions, which are powerful ways of defining patterns to match strings. This exercise will get you started with writing them.

When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.

The [regular expression module](https://docs.python.org/3/library/re.html) in python is re. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using re.compile(), and then use the compiled pattern to match values.

* Import re.
* Compile a pattern that matches a phone number of the format xxx-xxx-xxxx.
    * Use \d{x} to match x digits. Here you'll need to use it three times: twice to match 3 digits, and once to match 4 digits.
    * Place the regular expression inside re.compile().
* Using the .match() method on prog, check whether the pattern matches the string '123-456-7890'.
* Using the same approach, now check whether the pattern matches the string '1123-456-7890'.

In [None]:
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

In [None]:
# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

In [None]:
# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))

#### Extracting numerical values from strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall() function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

* Write a pattern that will find all the numbers in the following string: 'the recipe calls for 10 strawberries and 1 banana'. To do this:
    * Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
    * \d is the pattern required to find digits. This should be followed with a + so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.
* Print the matches to confirm that your regular expression found the values 10 and 1.

In [None]:
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

In [None]:
# Print the matches
print(matches)

#### Pattern matching

In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

* Write patterns to match:
    * A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
    * A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
        * Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \\. to match the decimal point, and \d{x} to match x number of digits.
    * A capital letter, followed by an arbitrary number of alphanumeric characters.
        * Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number of alphanumeric characters.

In [None]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

In [None]:
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

In [None]:
# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)

### Using Functions to Clean Data
#### Custom Functions to Clean Data