## Basic Data Processing with Pandas

### DataFrame Data Structures

`DataFrame` is the primary object in the `Pandas` library. It is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label.

You can think of the `DataFrame` as simply as a two-axes labeled array.

In [1]:
import pandas as pd

In [3]:
# School records for students
record1 = pd.Series({'Name': 'Alice',
                     'Class': 'Physics',
                     'Score': 85})

record2 = pd.Series({'Name': 'Jack',
                     'Class': 'Chemistry',
                     'Score': 82})

record3 = pd.Series({'Name': 'Helen',
                     'Class': 'Biology',
                     'Score': 90})

In [4]:
# Create a DataFrame using the individual Series as "rows"
df = pd.DataFrame([record1, record2, record3], index=['school1', 'school2', 'school3'])

df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school3,Helen,Biology,90


In [8]:
# Create a DataFrame using dictionaries
students = [{'Name': 'Alice',
             'Class': 'Physics',
             'Score': 85},
             {'Name': 'Jack',
              'Class': 'Chemistry',
              'Score': 82},
             {'Name': 'Helen',
              'Class': 'Biology',
              'Score': 90}]

df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [9]:
# Using `.loc[]`
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [10]:
# Type of .loc is a Series
type(df.loc['school2'])

pandas.core.series.Series

In [11]:
# Non-unique values in the index will give different rows and type
df.loc['school1']

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


In [12]:
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [13]:
# Select data based on multiple axes
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [14]:
# Pandas reserves the indexing operator for column selection. Columns always have a name --> Column projection
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [15]:
# However you cannot use `.loc` for column projection
# df.loc['Name'] # Big error

In [16]:
# Use indexing to select the values for specific indexes and specific columns
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

**Chaining**

In [17]:
# Chaining, by indexing, tends to cause Pandas to return a copy of the DataFrame instead of a view
df.loc[:, ['Name', 'Score']] # : all the rows, ['col1', 'col2']

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


**Dropping data**

In [18]:
# Drop function to delete data in Series and DataFrames: returns a copy of the DataFrame!
df.drop('school1') # A normal drop

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [19]:
df # df still intact

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [20]:
# Optional parameters for drop: inplace: updated inplace, axes: which should be dropped (default=0, i.e., rows)
copy_df = df.copy()

copy_df.drop('Name', inplace=True, axis=1) # We drop the column 'Name' (axis 1) and it's updated inplace

copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


In [21]:
# Another way is to use `del`
del copy_df['Class']
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


In [22]:
# Adding a new column
df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,


### DataFrame Indexing and Loading

In Jupyter, you can integrate lower level shell commands to the data science workflow using `IPython`. The shell command used here is `cat` for "concatenate". Using the exclamation mark (`!`) at the beginning of the cell will execute the remainder of the line as a shell command.

In [28]:
# !cat ../resources/week-2/datasets/Admission_Predict.csv # Shows the csv file 

Par�metro no v�lido: /resources


In [29]:
df = pd.read_csv('../resources/week-2/datasets/Admission_Predict.csv')
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [30]:
# Set the Serial No as the index for the students
df = pd.read_csv('../resources/week-2/datasets/Admission_Predict.csv', index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [31]:
# Change the column names "SOP" and "LOR" to make it more clear --> Use `rename`
new_df = df.rename(columns={'GRE Score': 'GRE Score', 'TOEFL Score': 'TOEFL Score',
                            'University Rating': 'University Rating',
                            'SOP': 'Statement of Purpose', 'LOR': 'Letter of Recommendation',
                            'CGPA': 'CGPA', 'Research': 'Research', 'Chance of Admit': 'Chance of Admit'})

new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [32]:
# LOR did not get renamed, so we need to check the column names
new_df.columns # 'LOR ' and 'Chance of Admit ' have a space

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

In [34]:
# Rename again
new_df = new_df.rename(columns={'LOR ': 'Letter of Recommendation'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [36]:
# We can create a function that does the cleaning and apply it to the data
new_df = new_df.rename(mapper=str.strip, axis='columns')
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [37]:
# We can use the `df.columns` by assigning to it a list of column names which will rename the columns (very efficient)
cols = list(df.columns)
cols = [x.lower().strip() for x in cols] # lowercase and strip functions for str

# Replace it in the df
df.columns = cols
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


### Querying a DataFrame