# Data Cleaning Walkthrough

## Introduction

At many points in your career, you'll need to be able to build complete, end-to-end data science projects on your own. **Data science projects usually consist of one of two things:**

- An exploration and analysis of a set of data. One example might involve analyzing donors to political campaigns, creating a plot, and then sharing an analysis of the plot with others.
- An operational system that generates predictions based on data that updates continually. An algorithm that pulls in daily stock ticker data and predicts which stock prices will rise and fall would be one example.

You'll find the ability to create data science projects useful in several different contexts:

- Projects will help you build a portfolio, which is critical to finding a job as a data analyst or scientist.
- Working on projects will help you learn new skills and reinforce existing concepts.
- Most "real-world" data science and analysis work consists of developing internal projects.
- Projects allow you to investigate interesting phenomena and satisfy your curiosity.

Whether you aim to become a data scientist or analyst or you're just curious about the world, building projects can be immensely rewarding.

**In this section, we'll walk through the first part of a complete data science project, including how to acquire the raw data.** The project will focus on exploring and analyzing a data set. We'll develop our **data cleaning** and **storytelling skills**, which will enable us to build complete projects on our own.

We'll focus **primarily on data exploration** in this section. We'll also combine several messy data sets into a single clean one to make analysis easier. Over the next few missions, we'll work through the rest of our project and perform the actual analysis.

**The first step in creating a project is to decide on a topic**. You want the topic to be something you're interested in and motivated to explore. It's very obvious when people are making projects just to make them, rather than out of a genuine interest in the topic.

Here are two ways to go about finding a good topic:

- Think about what sectors or angles you're really interested in, then find data sets relating to those sectors.
- Review several data sets, and find one that seems interesting enough to explore.

Whichever approach you take, you can start your search at these sites:

- [Data.gov](https://www.data.gov/) - A directory of government data downloads
- [/r/datasets](https://reddit.com/r/datasets) - A subreddit that has hundreds of interesting data sets
- [Awesome datasets](https://github.com/caesar0301/awesome-public-datasets) - A list of data sets hosted on GitHub
- [rs.io](http://rs.io/100-interesting-data-sets-for-statistics/) - A great blog post with hundreds of interesting data sets
In real-world data science, you may not find an ideal data set. You might have to aggregate disparate data sources instead, or do a good amount of data cleaning.

For the purposes of this project, we'll be using data about New York City public schools, which can be found [here](https://data.cityofnewyork.us/browse?category=Education).

## Finding All of the Relevant Data Sets

Once you've chosen a topic, you'll want to pick an angle to investigate. It's important to choose an angle that has enough depth to analyze, but isn't so complicated that it's difficult to get started. You want to finish the project, and you want your results to be interesting to others.

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests, and whether they're unfair to certain groups. Given our prior knowledge of this topic, investigating the correlations between [SAT scores](https://en.wikipedia.org/wiki/SAT) and demographics might be an interesting angle to take. We could correlate SAT scores with factors like race, gender, income, and more.

The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it.

The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the data set for our project is based on 2,400 total points). **Organizations often rank high schools by their average SAT scores.** The scores are also considered a measure of overall school district quality.

New York City makes its data on [high school SAT scores](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4) available online, as well as the [demographics for each high school](https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2014-2015/n3p6-zve2). The first few rows of the SAT data look like this:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1_1f_s0jbjSH1YeWs-W3Ioy8mqDqaYYNb"></left>

Unfortunately, combining both of the data sets won't give us all of the demographic information we want to use. We'll need to supplement our data with other sources to do our full analysis.

The same website has several related data sets covering demographic information and test scores. Here are the links to all of the data sets we'll be using:

- [SAT scores by school](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4) - SAT scores for each high school in New York City
- [School attendance](https://data.cityofnewyork.us/Education/School-Attendance-and-Enrollment-Statistics-by-Dis/7z8d-msnt) - Attendance information for each school in New York City
- [Class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3) - Information on class size for each school
- [AP test results](https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e) - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
- [Graduation outcomes](https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a) - The percentage of students who graduated, and other outcome information
- [Demographics](https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j) - Demographic information for each school
- [School survey](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) - Surveys of parents, teachers, and students at each school
All of these data sets are interrelated. We'll need to combine them into a single data set before we can find correlations.

## Finding Background Information

Before we move into coding, we'll need to do some background research. A thorough understanding of the data will help us avoid costly mistakes, such as thinking that a column represents something other than what it does. Background research will also give us a better understanding of how to combine and analyze the data.

In this case, we'll want to research:

- [New York City](https://en.wikipedia.org/wiki/New_York_City)
- [The SAT](https://en.wikipedia.org/wiki/SAT)
- [Schools in New York City](https://en.wikipedia.org/wiki/List_of_high_schools_in_New_York_City)
- [Our data](https://data.cityofnewyork.us/browse?category=Education)

We can learn a few different things from these resources. For example:

- Only high school students take the SAT, so we'll want to focus on high schools.
- New York City is made up of five boroughs, which are essentially distinct regions.
- New York City schools fall within several different school districts, each of which can contains dozens of schools.
- Our data sets include several different types of schools. We'll need to clean them so that we can focus on high schools only.
- Each school in New York City has a unique code called a **DBN**, or district borough number.
- Aggregating data by district will allow us to use the district mapping data to plot district-by-district differences.

## Reading in the Data

Once we've done our background research, we're ready to read in the data. For your convenience, we've placed all the data into the schools folder. Here are all of the files in the folder:

- **ap_2010.csv** - [Data on AP test results](https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e)
- **class_size.csv** - Data on [class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3)
- **demographics.csv** - Data on [demographics](https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j)
- **graduation.csv** - Data on [graduation outcomes](https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a)
- **hs_directory.csv** - [A directory of high schools](https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2014-2015/n3p6-zve2)
- **sat_results.csv** - Data on [SAT scores](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4)
- **survey_all.txt** - Data on [surveys](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) from all schools
- **survey_d75.txt** - Data on surveys from New York City district 75

**survey_all.txt** and **survey_d75.txt** are in more complicated formats than the other files. For now, we'll focus on reading in the - CSV files only, and then explore them.

We'll read each file into a [pandas dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), and then store all of the dataframes in a dictionary. This will give us a convenient way to store them, and a quick way to reference them later on.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Read each of the files in the list **data_files** into a pandas dataframe using the **pandas.read_csv()** function.
- Recall that all of the data sets are in the **datasets** folder. That means the path to **ap_2010.csv** is **datasets/ap_2010.csv**.
- Add each of the dataframes to the **dictionary data**, using the base of **filename as the key**. For example, you'd enter **ap_2010** for the file **ap_2010.csv.**
- Afterwards, **data** should have the following keys:
    - **ap_2010**
    - **class_size**
    - **demographics**
    - **graduation**
    - **hs_directory**
    - **sat_results**
- In addition, each key in **data** should have the corresponding dataframe as its value.


In [13]:
import pandas as pd

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}
for i in data_files:
    df = pd.read_csv("datasets/" + i)
    name = i[:-4]
    data[name] = df
    print(df.shape)


(258, 5)
(27611, 16)
(10075, 38)
(25096, 23)
(435, 64)
(478, 6)


## Exploring the SAT Data

What we're mainly interested in is the SAT data set, which corresponds to the dictionary key **sat_results**. This data set contains the **SAT scores** for each high school in New York City. We eventually want to correlate selected information from this data set with information in the other data sets.

Let's explore **sat_results** to see what we can discover. Exploring the dataframe will help us understand the structure of the data, and make it easier for us to analyze it.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Display the first five rows of the SAT scores data.
    - Use the key **sat_results** to access the SAT scores dataframe stored in the dictionary **data.**
    - Use the **pandas.DataFrame.head()** method along with the **print()** function to display the first five rows of the dataframe.

In [19]:
data['sat_results'].head()

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384


## Exploring the Remaining Data

When we printed the first five rows of the SAT data, the output looked like this:

```python
DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
2  01M450                     EAST SIDE COMMUNITY SCHOOL
3  01M458                      FORSYTH SATELLITE ACADEMY
4  01M509                        MARTA VALLE HIGH SCHOOL

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404
1                     91                             383                 423
2                     70                             377                 402
3                      7                             414                 401
4                     44                             390                 433

  SAT Writing Avg. Score
0                    363
1                    366
2                    370
3                    359
4                    384
```

We can make a few observations based on this output:

- The **DBN** appears to be a unique ID for each school.
- We can tell from the first few rows of names that we only have data about high schools.
- There's only a single row for each high school, so each **DBN** is unique in the SAT data.
- We may eventually want to combine the three columns that contain SAT scores -- **SAT Critical Reading Avg.**, **Score SAT Math Avg. Score**, and **SAT Writing Avg. Score** -- into a single column to make the scores easier to analyze.
Given these observations, let's explore the other data sets to see if we can gain any insight into how to combine them.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Loop through each **key** in **data**. For each key:
    - Display the first five rows of the dataframe associated with the **key**.

In [29]:
for key in data:
    print(data[key].head())

      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  
   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT   
2    1

## Reading in the Survey Data


We can make some observations based on the first few rows of each one.

- Each data set appears to either have a **DBN** column, or the information we need to create one. That means we can use a **DBN** column to combine the data sets. First we'll pinpoint matching rows from different data sets by looking for identical **DBNs**, then group all of their columns together in a single data set.
- Some fields look interesting for mapping -- particularly **Location 1**, which contains coordinates inside a larger string.
- Some of the data sets appear to contain multiple rows for each school (because the rows have duplicate **DBN** values). That means we’ll have to do some preprocessing to ensure that each **DBN** is unique within each data set. If we don't do this, we'll run into problems when we combine the data sets, because we might be merging two rows in one data set with one row in another data set.

Before we proceed with the merge, we should make sure we have all of the data we want to unify. We mentioned the survey data earlier (**survey_all.txt** and **survey_d75.txt**), but we didn't read those files in because they're in a slightly more complex format.

Each survey text file looks like this:

```python
dbn bn  schoolname  d75 studentssurveyed    highschool  schooltype  rr_s
"01M015"    "M015"  "P.S. 015 Roberto Clemente" 0   "No"    0   "Elementary School"     88
```

The files are tab delimited and encoded with **Windows-1252** encoding. An encoding defines how a computer stores the contents of a file in binary. The most common encodings are **UTF-8** and **ASCII**. **Windows-1252** is rarely used, and can cause errors if we read such a file in without specifying the encoding. If you'd like to read more about encodings, [here's](http://kunststube.net/encoding/) a good primer.

We'll need to specify the encoding and delimiter to the pandas [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to ensure it reads the surveys in properly.

After we read in the survey data, we'll want to combine it into a single dataframe. We can do this by calling the [pandas.concat()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) function:

```python
z = pd.concat([x,y], axis=0)
```

The code above will combine dataframes x and y by essentially appending y to the end of x. The combined dataframe z will have the number of rows in x plus the number of rows in y.

## Reading in the Survey Data

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Read in **survey_all.txt.**
    - Use the **pandas.read_csv()** function to read **survey_all.txt** into the variable **all_survey**. Recall that this file is located in the **datasets** folder.
    - Specify the keyword argument **delimiter="\t".**
    - Specify the keyword argument **encoding="windows-1252".**
- Read in **survey_d75.txt.**
    - Use the **pandas.read_csv()** function to read **datasets/survey_d75.txt** into the variable **d75_survey**. Recall that this file is located in the **datasets** folder.
    - Specify the keyword argument **delimiter="\t".**
    - Specify the keyword argument **encoding="windows-1252".**
- Combine **d75_survey** and **all_survey** into a single dataframe.
    - Use the pandas [concat()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) function with the keyword argument **axis=0** to combine **d75_survey** and **all_survey** into the dataframe **survey**.
    - Pass in **all_survey** first, then **d75_survey** when calling the **pandas.concat()** function.
- Display the first five rows of **survey** using the **pandas.DataFrame.head()** function.



In [45]:
all_survey = pd.read_csv("datasets/survey_all.txt", delimiter="\t", encoding="windows-1252")
d75_survey = pd.read_csv("datasets/survey_d75.txt", delimiter="\t", encoding="windows-1252")

survey = pd.concat([all_survey, d75_survey], axis=0,sort=True)

## Cleaning Up the Surveys

In the last step, the expected output was:

```python
    N_p  N_s  N_t  aca_p_11  aca_s_11  aca_t_11  aca_tot_11    bn  com_p_11  \
0   90  NaN   22       7.8       NaN       7.9         7.9  M015       7.6   
1  161  NaN   34       7.8       NaN       9.1         8.4  M019       7.6
```

There are two immediate facts that we can see in the data:

- There are over **2000** columns, nearly all of which we don't need. We'll have to filter the data to remove the unnecessary ones. Working with fewer columns will make it easier to print the dataframe out and find correlations within it.
- The survey data has a **dbn** column that we'll want to convert to uppercase (**DBN**). The conversion will make the column name consistent with the other data sets.

First, we'll need filter the columns to remove the ones we don't need. Luckily, there's a data dictionary at the [original data download](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) location. The dictionary tells us what each column represents. Based on our knowledge of the problem and the analysis we're trying to do, we can use the data dictionary to determine which columns to use.

Here's a preview of the data dictionary:


<left><img width="800" src="https://drive.google.com/uc?export=view&id=145GxwJiDOodRgyXfV15ybUBqfbDLFwEU"></left>


Based on the dictionary, it looks like these are the relevant columns:

```python
["dbn", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", 
 "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", 
 "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11",
 "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
```

These columns will give us aggregate survey data about how parents, teachers, and students feel about school safety, academic performance, and more. It will also give us the **DBN**, which allows us to uniquely identify the school.

Before we filter columns out, we'll want to copy the data from the **dbn** column into a new column called **DBN**. We can copy columns like this:

```python
survey["new_column"] = survey["old_column"]
```

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Copy the data from the **dbn** column of **survey** into a new column in **survey** called **DBN**.
- Filter **survey** so it only contains the columns we listed above. You can do this using **pandas.DataFrame.loc[]**.
    - Remember that we renamed **dbn** to **DBN**; be sure to change the list of columns we want to keep accordingly.
- Assign the dataframe **survey** to the key **survey** in the dictionary **data**.
- When you're finished, the value in **data["survey"]** should be a dataframe with 23 columns and 1702 rows.


In [46]:
names = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", 
 "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", 
 "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11",
 "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

In [53]:
survey["DBN"] = survey.dbn
print(survey.shape)


survey = survey[names]

print(novo.shape)

(1702, 2774)
(1702, 23)


## Inserting DBN Fields

When we explored all of the data sets, we noticed that some of them, like **class_size** and **hs_directory**, don't have a DBN column. **hs_directory** does have a **dbn** column, though, so we can just rename it.

However, **class_size** doesn't appear to have the column at all. Here are the first few rows of the data set:

```python
    CSD BOROUGH   SCHOOL CODE         SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED
```

Here are the first few rows of the **sat_results** data, which does have a **DBN** column:

```python
DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
2  01M450                     EAST SIDE COMMUNITY SCHOOL
3  01M458                      FORSYTH SATELLITE ACADEMY
4  01M509                        MARTA VALLE HIGH SCHOOL
```

From looking at these rows, we can tell that the **DBN** in the **sat_results** data is just a combination of the **CSD** and **SCHOOL CODE** columns in the **class_size** data. The main difference is that the **DBN** is padded, so that the **CSD** portion of it always consists of two digits. That means we'll need to add a leading 0 to the **CSD** if the **CSD** is less than two digits long. Here's a diagram illustrating what we need to do:

<center><img width="150" src="https://drive.google.com/uc?export=view&id=1k1QuSctrJnMW6gfAbTs-DJbzTe23en2U"></center>

As you can see, whenever the **CSD** is less than two digits long, we need to add a leading 0. We can accomplish this using the **pandas.Series.apply()** method, along with a custom function that:

- Takes in a number.
- Converts the number to a string using the **str()** function.
- Check the length of the string using the **len()** function.
    - If the string is two digits long, returns the string.
    - If the string is one digit long, adds a 0 to the front of the string, then returns it.
        - You can use the string method [zfill()](https://docs.python.org/3/library/stdtypes.html#str.zfill) to do this.

Once we've padded the **CSD**, we can use the addition operator (+) to combine the values in the **CSD** and **SCHOOL CODE** columns. Here's an example of how we would do this:

```python
dataframe["new_column"] = dataframe["column_one"] + dataframe["column_two"]
```

And here's a diagram illustrating the basic concept:

<center><img width="300" src="https://drive.google.com/uc?export=view&id=1Io7o-45Pixlv2tX8rTyZvL6KjIq9qMPz"></center>


## Inserting DBN Fields

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Copy the **dbn** column in **hs_directory** into a new column called **DBN**.
- Create a new column called **padded_csd** in the **class_size** data set.
    - Use the [pandas.Series.apply()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) method along with the function **pad_csd()** to generate this column.
        - Make sure to apply the function along the **data["class_size"]["CSD"]** column.
- Use the addition operator (+) along with the **padded_csd** and **SCHOOL CODE** columns of **class_size**, then assign the result to the **DBN** column of **class_size**.
- Display the first few rows of **class_size** to double check the **DBN** column.

In [51]:
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

In [56]:
def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return string_representation.zfill(2)

In [65]:
data['class_size']['padded_csd'] = data['class_size']['CSD'].apply(lambda x : pad_csd(x))

In [69]:
data['class_size']['DBN'] = data['class_size']['padded_csd'] + data['class_size']['SCHOOL CODE']

In [71]:
data['class_size']['DBN'].head(5)

0    01M015
1    01M015
2    01M015
3    01M015
4    01M015
Name: DBN, dtype: object

## Combining the SAT Scores


Now we're almost ready to combine our data sets. Before we do, let's take some time to calculate variables that will be useful in our analysis. We've already discussed one such variable -- a column that totals up the SAT scores for the different sections of the exam. This will make it much easier to correlate scores with demographic factors because we'll be working with a single number, rather than three different ones.

Before we can generate this column, we'll need to convert the **SAT Math Avg. Score**, **SAT Critical Reading Avg. Score**, and **SAT Writing Avg. Score** columns in the **sat_results** data set from the object (string) data type to a numeric data type. We can use the [pandas.to_numeric()](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_numeric.html) method for the conversion. If we don't convert the values, we won't be able to add the columns together.

It's important to pass the keyword argument **errors="coerce"** when we call **pandas.to_numeric()**, so that pandas treats any invalid strings it can't convert to numbers as missing values instead.

After we perform the conversion, we can use the addition operator (+) to add all three columns together.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Convert the **SAT Math Avg. Score**, **SAT Critical Reading Avg. Score**, and **SAT Writing Avg. Score** columns in the **sat_results** data set from the object (string) data type to a numeric data type.
    - Use the **pandas.to_numeric()** function on each of the columns, and assign the result back to the same column.
    - Pass in the keyword argument **errors="coerce".**
- Create a column called **sat_score** in **sat_results** that holds the combined SAT score for each student.
    - Add up **SAT Math Avg. Score**, **SAT Critical Reading Avg. Score**, and **SAT Writing Avg. Score**, and assign the total to the **sat_score** column of **sat_results.**
- Display the first few rows of the **sat_score column** of **sat_results** to verify that everything went okay.

In [76]:
data['sat_results']['SAT Math Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Math Avg. Score'], errors="coerce")
data['sat_results']['SAT Critical Reading Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Critical Reading Avg. Score'], errors="coerce")
data['sat_results']['SAT Writing Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Writing Avg. Score'], errors="coerce")

In [78]:
data['sat_results']['sat_score'] = data['sat_results']['SAT Math Avg. Score'] + \
                        data['sat_results']['SAT Critical Reading Avg. Score'] + \
                        data['sat_results']['SAT Writing Avg. Score']

In [94]:
data['sat_results'][data['sat_results']['SAT Writing Avg. Score'].isnull()].shape

(57, 7)

In [95]:
data['sat_results']['sat_score'].head()

0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64

## Parsing Geographic Coordinates for Schools

Next, we'll want to parse the **latitude** and **longitude** coordinates for each school. This will enable us to map the schools and uncover any geographic patterns in the data. The coordinates are currently in the text field **Location 1** in the **hs_directory** data set.

Let's take a look at the first few rows:

```python
0    883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1    1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2    1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3    411 Pearl Street\nNew York, NY 10038\n(40.7106...
4    160-20 Goethals Avenue\nJamaica, NY 11432\n(40...
```        
                                               
As you can see, this field contains a lot of information we don't need. We want to extract the coordinates, which are in parentheses at the end of the field. Here's an example:

```python
1110 Boston Road\nBronx, NY 10456\n(40.8276026690005, -73.90447525699966)
```

We want to extract the **latitude**, 40.8276026690005, and the **longitude**, -73.90447525699966. Taken together, **latitude** and **longitude** make up a pair of coordinates that allows us to pinpoint any location on Earth.

We can do the extraction with a **regular expression**. The following expression will pull out everything inside the parentheses:

```python
import re
re.findall("\(.+\)", "1110 Boston Road\nBronx, NY 10456\n(40.8276026690005, -73.90447525699966)")
```

This command will return (40.8276026690005, -73.90447525699966). We'll need to process this result further using the string methods [split()](https://docs.python.org/3/library/stdtypes.html#str.split) and [replace()](https://docs.python.org/3/library/stdtypes.html#str.replace) methods to extract each coordinate.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Write a function that:
    - Takes in a string
    - Uses the regular expression above to extract the coordinates
    - Uses string manipulation functions to pull out the latitude
    - Returns the latitude
- Use the [Series.apply()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) method to apply the function across the **Location 1** column of **hs_directory**. Assign the result to the **lat** column of **hs_directory.**
- Display the first few rows of **hs_directory** to verify the results.

In [109]:
import re

def get_lat(text):
    lat_long = re.findall("\(.+\)", text)[0]
    return lat_long[1:-1].split(',')[0]

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(lambda x: get_lat(x))

In [110]:
data["hs_directory"]["lat"]

0       40.67029890700047
1        40.8276026690005
2      40.842414068000494
3       40.71067947100045
4      40.718810094000446
5      40.840513977000455
6       40.71196311300048
7       40.73248537800049
8      40.713577459000476
9        40.6978073300005
10      40.83695342600049
11      40.71067947100045
12      40.86604554900049
13      40.74218869400045
14      40.88005009300048
15      40.84887878800049
16      40.64866366300049
17      40.77429641100048
18      40.81113885600047
19      40.82230376500047
20     40.840513977000455
21      40.69717472700046
22      40.59865238600048
23     40.713577459000476
24      40.81091785300049
25      40.86001222100049
26      40.77429641100048
27      40.63490809500047
28      40.69693735800047
29      40.59359381100046
              ...        
405     40.74509351900048
406     40.82117077000049
407     40.73551946300046
408      40.6595170060005
409     40.75239242400045
410    40.528228767000485
411    40.760414170000445
412    40.76

## Extracting the Longitude

On the last screen, we parsed the **latitude** from the **Location 1** column. Now we'll just need to do the same for the longitude.

Once we have both coordinates, we'll need to convert them to numeric values. We can use the **pandas.to_numeric()** function to convert them from strings to numbers.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Write a function that:
    - Takes in a string.
    - Uses the regular expression above to extract the coordinates.
    - Uses string manipulation functions to pull out the longitude.
    - Returns the longitude.
- Use the **Series.apply()** method to apply the function across the **Location 1** column of **hs_directory**. Assign the result to the **lon** column of **hs_directory.**
- Use the **to_numeric()** function to convert the **lat** and **lon** columns of **hs_directory** to numbers.
    - Specify the **errors="coerce"** keyword argument to handle missing values properly.
- Display the first few rows of hs_directory to verify the results.

In [111]:
import re

def get_lon(text):
    lat_long = re.findall("\(.+\)", text)[0]
    return lat_long[1:-1].split(',')[1]

data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(lambda x: get_lon(x))

## Next Steps

We're almost ready to combine our data sets! We've come a long way in this mission -- we've gone from choosing a topic for a project to acquiring the data to having clean data that we're almost ready to combine.

Along the way, we've learned how to:

- Handle files with different formats and columns
- Prepare to merge multiple files
- Use text processing to extract coordinates from a string
- Convert columns from strings to numbers

You'll always learn something new while working on a real-world data science project. Each project is unique, and there will always be quirks you don't quite know how to handle. The key is to be willing to try different approaches, and to have a general framework in your head for how to move from Step A to Step B.

In the next section,  we'll finish cleaning the data sets, then combine them so we can start our analysis.

# Data Cleaning Walkthrough: Combining the Data

## Introduction

In the last section, we began investigating possible relationships between **SAT scores** and **demographic factors**. In order to do this, we acquired several data sets about [New York City public schools](https://data.cityofnewyork.us/data?cat=education). We manipulated these data sets, and found that we could combine them all using the **DBN** column. All of the data sets are currently stored as **keys** in the **data** dictionary. Each individual data set is a pandas dataframe.

In this section, **we'll clean the data a bit more**, then **combine** it. Finally, we'll **compute correlations** and perform some analysis.

The first thing we'll need to do in preparation for the merge is condense some of the data sets. In the last section, we noticed that the values in the **DBN** column were unique in the **sat_results** data set. Other data sets like **class_size** had duplicate **DBN** values, however.

We'll need to condense these data sets so that each value in the **DBN** column is unique. If not, we'll run into issues when it comes time to combine the data sets.

While the main data set we want to analyze, **sat_results**, has unique **DBN** values for every high school in New York City, other data sets aren't as clean. A single row in the **sat_results** data set may match multiple rows in the **class_size** data set, for example. This situation will create problems, because we don't know which of the multiple entries in the **class_size** data set we should combine with the single matching entry in **sat_results**. Here's a diagram that illustrates the problem:


<left><img width="400" src="https://drive.google.com/uc?export=view&id=1deYm5RdQXO2xMX6dUgHLvqDEWipk3axq"></left>

In the diagram above, we can't just combine the rows from both data sets because there are several cases where multiple rows in **class_size** match a single row in **sat_results.**

To resolve this issue, we'll condense the **class_size**, **graduation**, and **demographics** data sets so that each **DBN** is unique.

## Condensing the Class Size Data Set

The first data set that we'll condense is **class_size**. The first few rows of **class_size** look like this:

|__| CSD | BOROUGH | SCHOOL CODE | SCHOOL NAME               | GRADE | PROGRAM TYPE | CORE SUBJECT (MS CORE and 9-12 ONLY) | CORE COURSE (MS CORE and 9-12 ONLY) | SERVICE CATEGORY(K-9* ONLY) | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS |
|---|-----|---------|-------------|---------------------------|-------|--------------|--------------------------------------|-------------------------------------|-----------------------------|-----------------------------------|--------------------|
| 0 | 1   | M       | M015        | P.S. 015 Roberto Clemente | 0K    | GEN ED       | -                                    | -                                   | -                           | 19.0                              | 1.0                |
| 1 | 1   | M       | M015        | P.S. 015 Roberto Clemente | 0K    | CTT          | -                                    | -                                   | -                           | 21.0                              | 1.0                |
| 2 | 1   | M       | M015        | P.S. 015 Roberto Clemente | 01    | GEN ED       | -                                    | -                                   | -                           | 17.0                              | 1.0                |
| 3 | 1   | M       | M015        | P.S. 015 Roberto Clemente | 01    | CTT          | -                                    | -                                   | -                           | 17.0                              | 1.0                |
| 4 | 1   | M       | M015        | P.S. 015 Roberto Clemente | 02    | GEN ED       | -                                    | -                                   | -                           | 15.0                              | 1.0                |

As you can see, the first few rows all pertain to the same school, which is why the **DBN** appears more than once. It looks like each school has multiple values for **GRADE**, **PROGRAM TYPE**, **CORE SUBJECT (MS CORE and 9-12 ONLY)**, and **CORE COURSE (MS CORE and 9-12 ONLY)**.

If we look at the unique values for **GRADE**, we get the following:

```python
array(['0K', '01', '02', '03', '04', '05', '0K-09', nan, '06', '07', '08',
       'MS Core', '09-12', '09'], dtype=object)
```

Because we're dealing with high schools, we're only concerned with grades 9 through 12. That means we only want to pick rows where the value in the **GRADE** column is **09-12**.

If we look at the unique values for **PROGRAM TYPE**, we get the following:

```python
array(['GEN ED', 'CTT', 'SPEC ED', nan, 'G&T'], dtype=object)
```

Each school can have multiple program types. Because **GEN ED** is the largest category by far, let's only select rows where **PROGRAM TYPE** is **GEN ED**.



## Condensing the Class Size Data Set

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


- Create a new variable called **class_size**, and assign the value of **data["class_size"]** to it.
- Filter **class_size** so the **GRADE** column only contains the value **09-12.** Note that the name of the **GRADE** column has a space at the end; you'll generate an error if you don't include it.
- Filter **lass_size** so that the **PROGRAM TYPE** column only contains the value **GEN ED.**
- Display the first five rows of **class_size** to verify.

In [125]:
class_size = data["class_size"]

class_size = class_size[class_size["GRADE "] == "09-12"]

class_size = class_size[class_size["PROGRAM TYPE"] == 'GEN ED']

class_size.head(5)

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO,padded_csd,DBN
225,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 9,-,63.0,3.0,21.0,19.0,25.0,STARS,,1,01M292
226,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 10,-,79.0,3.0,26.3,24.0,31.0,STARS,,1,01M292
227,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 11,-,38.0,2.0,19.0,16.0,22.0,STARS,,1,01M292
228,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,ENGLISH,English 12,-,69.0,3.0,23.0,13.0,30.0,STARS,,1,01M292
229,1,M,M292,Henry Street School for International Studies,09-12,GEN ED,MATH,Integrated Algebra,-,53.0,3.0,17.7,16.0,21.0,STARS,,1,01M292


## Computing Average Class Sizes

As we saw when we displayed **class_size** on the last screen, **DBN** still isn't completely unique. This is due to the **CORE COURSE (MS CORE and 9-12 ONLY)** and **CORE SUBJECT (MS CORE and 9-12 ONLY)** columns.

**CORE COURSE (MS CORE and 9-12 ONLY)** and **CORE SUBJECT (MS CORE and 9-12 ONLY)** seem to pertain to different kinds of classes. For example, here are the unique values for **CORE SUBJECT (MS CORE and 9-12 ONLY)**:

```python
array(['ENGLISH', 'MATH', 'SCIENCE', 'SOCIAL STUDIES'], dtype=object)
```

This column only seems to include certain subjects. We want our class size data to include every single class a school offers -- not just a subset of them. What we can do is take the average across all of the classes a school offers. This will give us unique **DBN** values, while also incorporating as much data as possible into the average.

Fortunately, we can use the [pandas.DataFrame.groupby()](http://pandas.pydata.org/pandas-docs/stable/groupby.html) method to help us with this. The **DataFrame.groupby()** method will split a dataframe up into unique groups, based on a given column. We can then use the **agg()** method on the resulting **pandas.core.groupby** object to find the **mean** of each column.

Let's say we have this data set:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1sJjENlTRR56RwYzBmmsU8aIMELgjx8zg"></left>

Using the **groupby()** method, we'll split this dataframe into four separate groups -- one with the **DBN 01M292**, one with the **DBN 01M332**, one with the **DBN 01M378**, and one with the **DBN 01M448**:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1y9imbMLKRDI50wQqPn7P6TAd6MfCL4Nq"></left>

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1FitnyClxHDQLnoAB3jR7YI_jEPZZhkco"></left>

Then, we can compute the averages for the **AVERAGE CLASS SIZE** column in each of the four groups using the **agg()** method:

<left><img width="200" src="https://drive.google.com/uc?export=view&id=1gHVZixGOuGYYON_zU0OUPTJcC9Q_mKeV"></left>

After we group a dataframe and aggregate data based on it, the column we performed the grouping on (in this case **DBN**) will become the index, and will no longer appear as a column in the data itself. To undo this change and keep **DBN** as a column, we'll need to use [pandas.DataFrame.reset_index()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html). This method will reset the index to a list of integers and make **DBN** a column again.

## Computing Average Class Sizes

- Find the average values for each column associated with each **DBN** in **class_size**.
    - Use the [pandas.DataFrame.groupby()](http://pandas.pydata.org/pandas-docs/stable/groupby.html) method to group **class_size** by **DBN**.
    - Use the [agg()](http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation) method on the resulting **pandas.core.groupby** object, along with the **numpy.mean()** function as an argument, to calculate the average of each group.
    - Assign the result back to **class_size**.
- Reset the index to make **DBN** a column again.
    - Use the [pandas.DataFrame.reset_index()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) method, along with the keyword argument **inplace=True**.
- Assign **class_size** back to the **class_size** key of the **data** dictionary.
- Display the first few rows of **data["class_size"]** to verify that everything went okay.

In [129]:
import numpy

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)

data["class_size"] = class_size
data["class_size"].head()

Unnamed: 0,DBN,CSD,NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,SCHOOLWIDE PUPIL-TEACHER RATIO
0,01M292,1,88.0,4.0,22.564286,18.5,26.571429,
1,01M332,1,46.0,2.0,22.0,21.0,23.5,
2,01M378,1,33.0,1.0,33.0,33.0,33.0,
3,01M448,1,105.6875,4.75,22.23125,18.25,27.0625,
4,01M450,1,57.6,2.733333,21.2,19.4,22.866667,


## Condensing the Demographics Data Set

Now that we've finished condensing **class_size**, let's condense **demographics**. The first few rows look like this:

| _| DBN    | Name                      | schoolyear | fl_percent | frl_percent | total_enrollment | prek | k  | grade1 | grade2 |
|---|--------|---------------------------|------------|------------|-------------|------------------|------|----|--------|--------|
| 0 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20052006   | 89.4       | NaN         | 281              | 15   | 36 | 40     | 33     |
| 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20062007   | 89.4       | NaN         | 243              | 15   | 29 | 39     | 38     |
| 2 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20072008   | 89.4       | NaN         | 261              | 18   | 43 | 39     | 36     |
| 3 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20082009   | 89.4       | NaN         | 252              | 17   | 37 | 44     | 32     |
| 4 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20092010   |  _          | 96.5        | 208              | 16   | 40 | 28     | 32     |

In this case, the only column that prevents a given **DBN** from being unique is **schoolyear**. We only want to select rows where schoolyear is **20112012**. This will give us the most recent year of data, and also match our SAT results data.

## Condensing the Demographics Data Set

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Filter **demographics**, only selecting rows in **data["demographics"]** where **schoolyear** is **20112012.**
    - **schoolyear** is actually an integer, so be careful about how you perform your comparison.
- Display the first few rows of **data["demographics"]** to verify that the filtering worked.

In [63]:
# put your code here

## Condensing the Graduation Data Set

Finally, we'll need to condense the **graduation** data set. Here are the first few rows:

| _ | Demographic  | DBN    | School Name                           | Cohort   | Total Cohort | Total Grads - n |
|---|--------------|--------|---------------------------------------|----------|--------------|-----------------|
| 0 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2003     | 5            | s               |
| 1 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2004     | 55           | 37              |
| 2 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2005     | 64           | 43              |
| 3 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2006     | 78           | 43              |
| 4 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2006 Aug | 78           | 44              |

The **Demographic** and **Cohort** columns are what prevent **DBN** from being unique in the **graduation** data. A **Cohort** appears to refer to the year the data represents, and the **Demographic** appears to refer to a specific demographic group. In this case, we want to pick data from the most recent Cohort available, which is 2006. We also want data from the full cohort, so we'll only pick rows where **Demographic** is **Total Cohort**.

## Condensing the Graduation Data Set

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Filter **graduation**, only selecting rows where the **Cohort** column equals **2006.**
- Filter **graduation**, only selecting rows where the **Demographic** column equals **Total Cohort**.
- Display the first few rows of **data["graduation"]** to verify that everything worked properly.

In [68]:
# put your code here

## Converting AP Test Scores

We're almost ready to combine all of the data sets. The only remaining thing to do is convert the [Advanced Placement (AP)](https://en.wikipedia.org/wiki/Advanced_Placement_exams) test scores from strings to numeric values. High school students take the AP exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit.

AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are much more rare in schools that lack funding or academic rigor.

It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools. To determine this, we'll need to convert the AP exam scores in the **ap_2010** data set to numeric values first.

There are three columns we'll need to convert:

- **AP Test Takers** (note that there's a trailing space in the column name)
- **Total Exams Taken**
- **Number of Exams with scores 3 4 or 5**

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Convert each of the following columns in **ap_2010** to numeric values using the [pandas.to_numeric()](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_numeric.html) function with the keyword argument **errors="coerce".**
    - **AP Test Takers**
    - **Total Exams Taken**
    - **Number of Exams with scores 3 4 or 5**
- Display the column types using the **dtypes** attribute.

In [70]:
# put your code here

## Left, Right, Inner, and Outer Joins

Before we merge our data, we'll need to decide on the merge strategy we want to use. We'll be using the pandas [pandas.DataFrame.merge()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) function, which supports four types of joins -- **left**, **right**, **inner**, and **outer**. Each of these join types dictates how pandas combines the rows.

We'll be using the **DBN** column to identify matching rows across data sets. In other words, the values in that column will help us know which row from the first data set to combine with which row in the second data set.

There may be **DBN** values that exist in one data set but not in another. This is partly because the data is from different years. Each data set also has inconsistencies in terms of how it was gathered. Human error (and other types of errors) may also play a role. Therefore, we may not find matches for the **DBN** values in **sat_results** in all of the other data sets, and other data sets may have **DBN** values that don't exist in **sat_results**.

We'll merge two data sets at a time. For example, we'll merge **sat_results** and **hs_directory**, then merge the result with **ap_2010**, then merge the result of that with **class_size**. We'll continue combining data sets in this way until we've merged all of them. Afterwards, we'll have roughly the same number of rows, but each row will have columns from all of the data sets.

The merge strategy we pick will affect the number of rows we end up with. Let's take a look at each strategy.

Let's say we're merging the following two data sets:

<left><img width="300" src="https://drive.google.com/uc?export=view&id=1Vlypix_SIkxCdRS0ABvO4tGiuvFLg321"></left>

With an **inner merge**, we'd only combine rows where the same **DBN** exists in both data sets. We'd end up with this result:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1LR4c8louX-JAZFYta_Y99FLsGkCf9grr"></left>

With a **left merge**, we'd only use **DBN** values from the dataframe on the "left" of the merge. In this case, **sat_results** is on the left. Some of the DBNs in **sat_results** don't exist in **class_size**, though. The merge will handle this by assiging null values to the columns in **sat_results** that don't have corresponding data in **class_size.**

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1hPoJ5wLECEzz25jrTP5bw9oZ0eerNi2p"></left>

With a **right merge**, we'll only use **DBN** values from the dataframe on the "right" of the merge. In this case, **class_size** is on the right:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1YYdf4iEMtHYqRBMTEFcfuTyu9zdFlnx7"></left>

With an outer merge, we'll take any DBN values from either sat_results or class_size:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1sl5wCK3WZ3lTzJm8JUn-bg4MoXl3xSe9"></left>

As you can see, each merge strategy has its advantages. Depending on the strategy we choose, we may preserve rows at the expense of having more missing column data, or minimize missing data at the expense of having fewer rows. Choosing a merge strategy is an important decision; it's worth thinking about your data carefully, and what trade-offs you're willing to make.

Because this project is concerned with determing demographic factors that correlate with SAT score, we'll want to preserve as many rows as possible from **sat_results** while minimizing null values.

This means that we may need to use different merge strategies with different data sets. Some of the data sets have a lot of missing **DBN** values. This makes a **left** join more appropriate, because we don't want to lose too many rows when we merge. If we did an **inner** join, we would lose the data for many high schools.

Some data sets have **DBN** values that are almost identical to those in **sat_results**. Those data sets also have information we need to keep. Most of our analysis would be impossible if a significant number of rows was missing from **demographics**, for example. Therefore, we'll do an inner join to avoid missing data in these columns.

##  Performing the Left Joins

Both the **ap_2010** and the **graduation** data sets have many missing **DBN** values, so we'll use a left join when we merge the **sat_results** data set with them. Because we're using a **left** join, our final dataframe will have all of the same **DBN** values as the original **sat_results** dataframe.

We'll need to use the pandas **df.merge()** method to merge dataframes. The "left" dataframe is the one we call the method on, and the "right" dataframe is the one we pass into **df.merge()**.

Because we're using the **DBN** column to join the dataframes, we'll need to specify the keyword argument **on="DBN"** when calling **pandas.DataFrame.merge().**

First, we'll assign **data["sat_results"]** to the variable **combined**. Then, we'll merge all of the other dataframes with **combined**. When we're finished, **combined** will have all of the columns from all of the data sets.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Use the pandas [pandas.DataFrame.merge()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) method to merge the **ap_2010** data set into **combined.**
    - Make sure to specify **how="left"** as a keyword argument to indicate the correct join type.
    - Make sure to assign the result of the merge operation back to **combined.**
- Use the pandas **df.merge()** method to merge the **graduation** data set into **combined.**
    - Make sure to specify **how="left"** as a keyword argument to get the correct join type.
    - Make sure to assign the result of the merge operation back to **combined.**
- Display the first few rows of **combined** to verify that the correct operations occurred.
- Use the [pandas.DataFrame.shape](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) attribute to display the shape of the dataframe and see how many rows now exist.

In [72]:
# put your code here

## Performing the Inner Joins

Now that we've performed the left joins, we still have to merge **class_size**, **demographics**, **survey**, and **hs_directory** into **combined**. Because these files contain information that's more valuable to our analysis and also have fewer missing **DBN** values, we'll use the **inner** join type.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Merge **class_size** into **combined**. Then, merge **demographics**, **survey**, and **hs_directory** into **combined** one by one, in that order.
    - Be sure to follow the exact order above.
    - Remember to specify the correct column to join on, as well as the correct join type.
- Display the first few rows of **combined** to verify that the correct operations occurred.
- Call **pandas.DataFrame.shape()** to display the shape of the dataframe to see how many rows now exist.

In [73]:
# put your code here

##  Filling in Missing Values

You may have noticed that the inner joins resulted in 116 fewer rows in **sat_results**. This is because pandas couldn't find the **DBN** values that existed in **sat_results** in the other data sets. While this is worth investigating, we're currently looking for high-level correlations, so we don't need to dive into which **DBNs** are missing.

You may also have noticed that we now have many columns with null (**NaN**) values. This is because we chose to do **left** joins, where some columns may not have had data. The data set also had some missing values to begin with. If we hadn't performed a **left** join, all of the rows with missing data would have been lost in the merge process, which wouldn't have left us with many high schools in our data set.

There are several ways to handle missing data, and we'll cover them in more detail later on. For now, we'll just fill in the missing values with the overall mean for the column, like so:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1OmhXzMuPrGSmyyugGXpmrRlLznDHxOeT"></left>

In the diagram above, the mean of the first column is (1800 + 1600 + 2200 + 2300) / 4, or 1975, and the mean of the second column is (20 + 30 + 30 + 50) / 4, or 32.5. We replace the missing values with the means of their respective columns, which allows us to proceed with analyses that can't handle missing values (like correlations).

We can fill in missing data in pandas using the [pandas.DataFrame.fillna()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method. This method will replace any missing values in a dataframe with the values we specify. We can compute the mean of every column using the [pandas.DataFrame.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) method. If we pass the results of the **df.mean()** method into the **df.fillna()** method, pandas will fill in the missing values in each column with the mean of that column.

Here's an example of how we would accomplish this:

```python
means = df.mean()
df = df.fillna(means)
```

Note that if a column consists entirely of null or **NaN** values, pandas won't be able to fill in the missing values when we use the **df.fillna()** method along with the **df.mean()** method, because there won't be a mean.

We should fill any **NaN** or null values that remain after the initial replacement with the value 0. We can do this by passing 0 into the **df.fillna()** method.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Calculate the means of all of the columns in **combined** using the **pandas.DataFrame.mean()** method.
- Fill in any missing values in **combined** with the means of the respective columns using the **pandas.DataFrame.fillna()** method.
- Fill in any remaining missing values in **combined** with 0 using the **df.fillna()** method.
- Display the first few rows of **combined** to verify that the correct operations occurred.

In [77]:
# put your code here

## Adding a School District Column for Mapping

We've finished cleaning and combining our data! We now have a clean data set on which we can base our analysis. Mapping the statistics out on a school district level might be an interesting way to analyze them. Adding a column to the data set that specifies the school district will help us accomplish this.

The school district is just the first two characters of the **DBN**. We can apply a function over the **DBN** column of **combined** that pulls out the first two letters.

For example, we can use indexing to extract the first few characters of a string, like this:

```python
name = "Sinbad"
print(name[0:2])
```

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Write a function that extracts the first two characters of a string and returns them.
- Apply the function to the **DBN** column of **combined**, and assign the result to the **school_dist** column of **combined**.
- Display the first few items in the **school_dist** column of **combined** to verify the results.


In [79]:
# put your code here

## Next Steps

We now have a clean data set we can analyze! We've done a lot in this mission. We've gone from having several messy sources to one clean, combined, data set that's ready for analysis.

Along the way, we've learned about:

- How to handle missing values
- Different types of merges
- How to condense data sets
- How to compute averages across dataframes

Data scientists rarely start out with tidy data sets, which makes cleaning and combining them one of the most critical skills any data professional can learn.

In the next mission, we'll analyze our clean data to find correlations and create maps.