# Assignment 2: Pandas & Simple Visualizations
## Part 2: Pandas on the Wellesley Courses dataset

Author: Shreya Parjan

Sep 19, 2019

Note: This submission made use of the automatic extension so that I had time to resubmit after reformatting some of my content following the helpful feedback on assignment 1.

This notebook shows the output that I should expect as I go through the process of writing code to complete certain steps in cleaning up the data.

**Steps to clean up the data**

* Load file from Excel
* Drop some columns we don't need
* Rename the remaining columns
* Create a new column for the department name
* Save data as a CSV

Collaboration: I discussed the assignment with Julia & Aviv.

## Table of contents
1. [Step 0: Relevant Initial Imports](#s0)
2. [Step 1: Load the data from Excel](#s1)
3. [Step 2: Drop some columns](#s2)
4. [Step 3: Rename columns](#s3)
5. [Step 4: Create new column for the department name](#s4)
6. [Step 5: Save the dataframe in a CSV file](#s5)

### Step 0: Relevant Initial Inputs
<a id="s0"></a>

In [2]:
import pandas as pd
import numpy as np

### Step 1: Load the data from Excel
<a id="s1"></a>

In [4]:
# reads in the initial untransformed data from Excel
data = pd.read_excel("courses-2019.xlsx")

### Step 2: Drop some columns
<a id="s2"></a>

The table has many more columns that we need. Let's practice dropping some of them. First, let's get all their names:

In [5]:
# Extracts the columns from data
columns = data.columns
print(columns)

Index(['CRN', 'Course', 'Title', 'CurrentEnrollment', 'SeatsAvailable',
       'Location(s)', 'Meeting Time(s)', 'Day(s)', 'Instructor',
       'Additional Instructor(s)', 'Distribution(s)', 'More'],
      dtype='object')


Now we can use the method `drop` to drop multiple columns at once, changing the existing dataframe. Look at the output to see which columns we are keeping. Then, copy and paste the column names that we will be dropping from the previous output cell.

In [6]:
# data_dropped is our new dataframe with several columns that aren't relevant to our analysis removed.
data_dropped = data.drop(columns=['Title','SeatsAvailable','Location(s)','Instructor','Additional Instructor(s)','Distribution(s)','More'])
data_dropped.head()

Unnamed: 0,CRN,Course,CurrentEnrollment,Meeting Time(s),Day(s)
0,13587,AFR 105 - 01,24,12:45 PM - 3:25 PM,T
1,15568,AFR 201 - 01,8,6:30 PM - 9:10 PM,M
2,15753,AFR 215 - 01,16,9:55 AM - 11:10 AM,MR
3,15071-15207,AFR 242 - 01,30,9:55 AM - 11:10 AM,TF
4,15570-15571,AFR 264 - 01,19,9:55 AM - 11:10 AM,TF


### 3. Rename columns
<a id="s3"></a>

Create a dictionary to map the old names to the new names. Then call the method `rename`, inplace. The new dataframe is shown below.

In [7]:
# data_renamed has renamed our columns from data_dropped

data_renamed = data_dropped.rename(columns={"CRN": "crn", "Course": "course", "CurrentEnrollment": "enrollment", "Meeting Time(s)":"meeting", "Day(s)":"days"})
data_renamed.head()

Unnamed: 0,crn,course,enrollment,meeting,days
0,13587,AFR 105 - 01,24,12:45 PM - 3:25 PM,T
1,15568,AFR 201 - 01,8,6:30 PM - 9:10 PM,M
2,15753,AFR 215 - 01,16,9:55 AM - 11:10 AM,MR
3,15071-15207,AFR 242 - 01,30,9:55 AM - 11:10 AM,TF
4,15570-15571,AFR 264 - 01,19,9:55 AM - 11:10 AM,TF


### 4. Create a new column for the department name
<a id="s4"></a>

This is an opportunity to use the method `apply` to create the department name from the course name, by splitting the string. You then can create a new column to store the result of the operation.

In [8]:
#splitter is a lambda function that splits the department name from the course name
# here we create a new column of the extracted department names
splitter = lambda x: x.split(" ")[0]
data_renamed['dept'] = data_renamed['course'].apply(splitter)
data_renamed.head()

Unnamed: 0,crn,course,enrollment,meeting,days,dept
0,13587,AFR 105 - 01,24,12:45 PM - 3:25 PM,T,AFR
1,15568,AFR 201 - 01,8,6:30 PM - 9:10 PM,M,AFR
2,15753,AFR 215 - 01,16,9:55 AM - 11:10 AM,MR,AFR
3,15071-15207,AFR 242 - 01,30,9:55 AM - 11:10 AM,TF,AFR
4,15570-15571,AFR 264 - 01,19,9:55 AM - 11:10 AM,TF,AFR


In [9]:
data_renamed.shape # total number of courses

(757, 6)

### 5. Save the daframe in a CSV file
<a id="s5"></a>

Writing into a file is done through the method `to_csv` and other similar methods (e.g., to_json, to_excel, etc.). Read the documentation for this method, to find out what parameters you need to supply. Notice that most of them have default valus that you can preserve.

In [10]:
# Finally, we write our dataframe to a csv.
data_renamed.to_csv('courses-cleanedby-shreya')