# Part 2 - Cleaning

## What did we grab?

In [1]:
import numpy as np
import pandas as pd
from collegedata_names import col_rename_dict
from collegedata_names import dirty_cols_extract_dict

# Import the scraped data.
PATH = 'data/test.csv'
na_vals = ['Not reported','Not Reported', 'Not available']
df = pd.read_csv(PATH, index_col = 'SchoolId', na_values = na_vals)

# Drop columns with no scraped data.
cols_to_drop = [df[col].name for col in df.columns if df[col].dropna().empty]
df.drop(columns = cols_to_drop, inplace = True)    

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2028 entries, 6 to 3379
Columns: 183 entries, Name to Disciplines Pursued
dtypes: float64(3), object(180)
memory usage: 2.8+ MB


We scraped up a dataframe of 2028 rows of schools with 183 columns of values - only 3 of which were already in a numeric (`float64` dtype) format. The remainder are saved as strings (`object` dtype).

---

## Convert the easy columns first

With just a few lines of code, we can try to convert more columns to numeric type:

In [2]:
# Delete commas and dollar signs from all values in the dataframe.
df.replace('[,\$]', '', regex = True, inplace = True)

# Convert percents to decimals, and attempt convert cols to numeric type.
for col in df.select_dtypes('object'):
    repl = lambda m: str(float(m.group(1)) / 100)
    df[col] = df[col].str.replace('([\d\.]+)%', repl)
    df[col] = pd.to_numeric(df[col], errors = 'ignore')
    
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2028 entries, 6 to 3379
Columns: 183 entries, Name to Disciplines Pursued
dtypes: float64(44), object(139)
memory usage: 2.8+ MB


We've removed commas, dollar signs, and converted percents to decimals, and then tried to convert the remaining string values to `float64`, which worked enough to get us up to 44 numeric columns. 

---

## Try converting to categorical 

There are still 139 `object` columns. Some of these should probably stay as strings - values like schools' Name and Web Site. Some values, like schools' Institution Type, can only take a few specific values ('Public', 'Private', 'Private for-profit'). We can convert any columns like these - those with less than 60 unique values - to `category` data type:

In [3]:
# Convert remaining cols with low number of unique vals to categorical cols.
for col in df.columns:
    if df[col].nunique() < 60:
        df[col] = df[col].astype('category')
        
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2028 entries, 6 to 3379
Columns: 183 entries, Name to Disciplines Pursued
dtypes: category(71), float64(38), object(74)
memory usage: 1.9+ MB


Creating 71 `category` columns has not only reduced the number of `object` string columns but also reduces the dataframe's memory footprint.

---

## Look for format patterns

The remaining 74 `object` columns have more than 60 unique values. Some columns, like the schools' Name, will not contain any numeric values hidden inside the strings, but many others, like the field Students Enrolled, contain multiple numeric values hidden inside their strings:

In [4]:
df['Students Enrolled'].head()

SchoolId
6       128 (0.61) of 211 admitted students enrolled
7     1573 (0.18) of 8666 admitted students enrolled
8      250 (0.23) of 1067 admitted students enrolled
9       488 (0.94) of 519 admitted students enrolled
10     465 (0.11) of 4169 admitted students enrolled
Name: Students Enrolled, dtype: object

A simple extraction of these three values with proper labeling would be straightforward. However, there is a problem. Stripping out the numeric values and looking at the unique value counts of this column's numeric formats:

In [5]:
df['Students Enrolled'].str.replace('\d+\.?\d*', '#').value_counts()

# (#) of # admitted students enrolled    1372
# admitted students enrolled               87
Name: Students Enrolled, dtype: int64

We see there are actually two different formats in use inside column, which hampers an effort to simply split and rename the column with basic pandas string operations. This is a problem not unique to this column and is apparent throughout many of the remaining `object` columns in the dataframe.

In this column's case, and with many others', there is a dominant 'mode' format, the format of more than 50% of the values. We can construct a stripped dataframe holding only these formats:

In [6]:
# Create copy of df with all numeric values and extra whitespace stripped.
cols = df.select_dtypes('object').columns
formats_df = df[cols].replace('\d\.?\d*','#', regex = True)
formats_df.replace('\s+',' ', regex = True, inplace = True)

Now we can get the mode format for each of the columns:

In [7]:
modes = formats_df.mode().loc[0]

Not every mode format includes a number (which we denoted in our previous replacement with a `'#'` placeholder). If a mode format string doesn't contain a `'#'`, it's likely a column best left as a string `object` datatype.

Similarly, if a numeric mode format is not dominant in its column (not accounting for at least 50% of all values), its likely the column values are best kept as strings.

We specify the numeric columns and restrict the modes and modes dataframe accordingly:

In [8]:
modes = modes[modes.str.contains('#', na = False)]
modes = modes[(formats_df == modes).sum() / formats_df.count() > 0.5]
formats_df = formats_df[modes.index]
print("{} numeric columns with majority mode columns.".format(len(modes)))

43 numeric columns with majority mode columns.


We can now pick out the actual dataframe values that are not in the mode format - removing them from the column and setting them aside in a 'dirty' dataframe while keeping only the mode format values in the original dataframe:

In [9]:
dirty_df = df[modes.index].where(formats_df != modes)
df[modes.index] = df[modes.index].where(formats_df == modes)

Now we can extract the numeric values from the 'clean' columns of values in mode formats into a new temporary 'clean' dataframe:

In [10]:
clean_df = pd.DataFrame(index = df.index)
for col in modes.index:
    vals = df[col].str.extractall('(\d+\.?\d*)').unstack()
    vals.columns = [col + ' - ' + str(i) for i in range(1, vals.shape[1] + 1)]
    for val_col in vals.columns:
        clean_df[val_col] = pd.to_numeric(vals[val_col])

The clean dataframe is filled with new columns of values split from the original columns and have been assigned temporary column labels. We can rename them using a dictionary loaded in an external module:

In [11]:
clean_df.rename(columns = col_rename_dict, inplace = True)

# Delete columns marked '*delete*' by dirty_cols_extract_dict.
clean_df.drop(columns = '*delete*', inplace = True)

We can now drop the original columns from the primary dataframe and then join the newly extracted and labeled columns:

In [12]:
df.drop(columns = modes.index, inplace = True)
df = df.join(clean_df)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2028 entries, 6 to 3379
Columns: 239 entries, Name to Average Starting Salary
dtypes: category(71), float64(137), object(31)
memory usage: 2.9+ MB


---

## Slightly pesky manual seperation 

Upon inspection of the remaining string columns, a few can still be quickly separated with the help of regular expressions. For instance, the location information was scraped inside a single label:

In [13]:
col = 'City, State, Zip'

vals = df[col].str.split(r'[\xa0]+', expand = True)
vals.columns = ['City', 'State', 'Zip']
vals['State'] = vals['State'].astype('category')

df.drop(columns = col, inplace = True)
df = df.join(vals)

The class and lab/discussion size information came in different formats, but always with the same bin labels:

In [14]:
bins = ['2-9', '10-19', '20-29', '30-39', '40-49', '50-99', 'Over 100']

regexs = [b + ' students: ([\d\.]+)' for b in bins]
class_labels = ['Class Size pct ' + b + ' students' for b in bins]
lab_labels = ['Lab/Discussion Size pct ' + b + ' students' for b in bins]

class_dict = dict(zip(class_labels, regexs))
lab_dict = dict(zip(lab_labels, regexs))

cols = ['Regular Class Size', 'Discussion Section/Lab Class Size']
extract_dict = {cols[0]: class_dict, cols[1]: lab_dict}

vals = pd.DataFrame(index = df.index)
for col, dictionary in extract_dict.items():
    for label, regex in dictionary.items():
        vals[label] = df[col].str.extract(regex)
        vals[label] = pd.to_numeric(vals[label])

df.drop(columns = cols, inplace = True)
df = df.join(vals)

And the nearest airport/train/bus fields contained both numeric distance and closest city strings:

In [15]:
cols = ['Nearest Airport', 'Nearest Train Station', 'Nearest Bus Station']
vals = pd.DataFrame(index = df.index)
for col in cols:
    extracted = df[col].str.extract('(\d+).* in (\D*)')
    vals[col + ' (miles)'] = pd.to_numeric(extracted[0])
    vals[col + ' (city)'] = extracted[1]

df.drop(columns = cols, inplace = True)
df = df.join(vals)

The application fee contains a string value that should be marked numerically as zero:

In [16]:
col = 'Application Fee'
df[col] = df[col].str.replace('No fee required', '0')
df[col] = df[col].str.extract(r'(\d+)')
df[col] = pd.to_numeric(df[col])

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2028 entries, 6 to 3379
Columns: 256 entries, Name to Nearest Bus Station (city)
dtypes: category(71), float64(155), object(30)
memory usage: 3.1+ MB


---
## Clean enough?
On this dataframe, this method of identifying a most common (mode) format allowed us to quickly isolate and convert many of the numeric columns with ease, though we did use some regular expressions to handle a few columns in the end.

We still have unprocessed strings in a dataframe of 'dirty' columns - strings that contain values locked up in formats different from the majority format of their columns. If we quit here without dealing with them, how many values will we lose?

In [17]:
dirty_df.count().sum()

4024

Bear in mind this is a count of strings - each string could contain multiple numeric values in need of extraction, along with further investigation to determine into which existing (or new) column the extracted values should be inserted. But how does this compare to the number of values already extracted?

In [18]:
df.count().sum()

342377

These 'dirty' strings represent just a bit over 1% of the total values extracted so far. 

For the sake of this project, I continued to extract all values. I built a large dict-of-dicts filled with regular expressions and column labels to put the values where they should be:

In [19]:
# Extract num vals from dirty cols using the imported dirty_cols_extract_dict.
vals = pd.DataFrame(index = df.index)
for dirty_col, extract_dict in dirty_cols_extract_dict.items():
    for col, regex in extract_dict.items():
        vals[col] = dirty_df[dirty_col].str.extract(regex, expand = False)
        vals[col] = pd.to_numeric(vals[col])

# Update the dataframe with the extracted dirty values.
new_cols = [col for col in vals.columns if col not in df.columns]
df = df.join(vals[new_cols])
df.update(vals)

df.sort_index(axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2028 entries, 6 to 3379
Columns: 258 entries, 2016 Graduates Who Took Out Loans to Zip
dtypes: category(71), float64(157), object(30)
memory usage: 3.2+ MB


---

## Cleanly saving

That's it. We'll save this to a .csv file and be done.

In [20]:
df.columns.tolist()

['2016 Graduates Who Took Out Loans',
 '24-Hour Emergency Phone/ Alarm Devices',
 '24-Hour Security Patrols',
 'ACT 25th',
 'ACT 75th',
 'ACT Mean',
 'Academic Calendar System',
 'Academic Interest/Achievement - Number of Awards',
 'Academic Interest/Achievement Award Areas',
 'Accept Offer of Admission',
 'Activities and Organizations',
 'Address',
 'Advanced Placement (AP) Examinations',
 'All Graduate Students',
 'All Undergraduates',
 'Application Deadline',
 'Application Fee',
 'Application Fee Waiver',
 'Applications (EA)',
 'Applications (ED)',
 'Applications (all)',
 'Applications (men)',
 'Applications (women)',
 'Athletic Conferences',
 'Average Age',
 'Average Award (All Undergraduates)',
 'Average Award (Freshmen)',
 'Average Earnings from On-Campus Employment',
 'Average Freshman Award',
 'Average GPA',
 'Average Indebtedness of 2016 Graduates',
 'Average Percent of Need Met',
 'Average Percent of Need Met (All Undergraduates)',
 'Average Percent of Need Met (Freshmen)',
 