### DS102 | In Class Practice Week 1B - Pandas & Numpy I
<hr>
## Learning Objectives
At the end of the lesson, you will be able to:

- read a CSV file into a `DataFrame` with default `sep` parameters as separators

- find out key properties (`shape`, `columns`, `dtypes`) of a `DataFrame`.

- Use `head()` and `sample()` to look at a subset of the dataset


- retrieve a row from a `DataFrame` using indices and `.iloc`

- retrieve a column from a `DataFrame`, recognising this gives a `Series` 


- filter data using numeric comparators e.g. `>=1000`

- filter data using string equality comparators e.g. `== 'RED'`

- filter data using multiple conditions using `.isin()` e.g. `.isin([1, 2, 3])`

- filter data using two conditions, using the AND clause


### Datasets Required for this In Class
1. `employees-1k.csv`

#### Import `pandas` and `numpy`

In [1]:
# import pandas & numpy, with aliases
import pandas as pd
import numpy as np

### Instantiate a `DataFrame`

**Instantiate a `DataFrame` from an external CSV file**
To manipulate data as a `DataFrame`, use `pd.read_csv()` to read the CSV file into a `DataFrame`. Refer to the documentation [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv).

In [6]:
# Read from CSV file
df = pd.read_csv('employees-1k.csv')
df.head()

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
0,44037036,55000.0,Account Specialist,OWN
1,41036228,184000.0,IT Architect,MORTGAGE
2,67601397,30000.0,night auditor,RENT
3,64972576,72000.0,Field Service Representative,MORTGAGE
4,59922079,84000.0,Senior Pastor,MORTGAGE


Find out the no. of rows and columns using `df.shape`

In [7]:
# Find no. of rows and columns
v= df.shape
print(v)

(1000, 4)


Check the columns in the `DataFrame` using `df.dtypes` and `df.columns`

In [8]:
# Check the columns using dtypes
df.dtypes

employee_id         int64
annual_inc        float64
employee_title     object
home_ownership     object
dtype: object

In [9]:
# Check the columns using columns
df.columns

Index(['employee_id', 'annual_inc', 'employee_title', 'home_ownership'], dtype='object')

Find out what's in the dataset using `head()` or `sample()`. `head()` will show the first 5 records by default, while `sample()` will randomly select and show 1 record by default.

In [10]:
# Find the first few records with .head()
df.head()

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
0,44037036,55000.0,Account Specialist,OWN
1,41036228,184000.0,IT Architect,MORTGAGE
2,67601397,30000.0,night auditor,RENT
3,64972576,72000.0,Field Service Representative,MORTGAGE
4,59922079,84000.0,Senior Pastor,MORTGAGE


In [18]:
# Randomly sample (without replacement) with .sample()
df.sample() 

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
937,13867035,60000.0,RN,RENT


### Retrieve records & columns from a `DataFrame`

#### Retrieve a column from a `DataFrame`
We will now retrieve all the `annual_inc` data from the `df`. Use the column name to retrieve the column and store that as a new variable `annual_incs`. This is known as a `Series`. 
<div class="alert alert-info">
<b>DS102 Learning Guidelines: </b>For DS102, only store <u>1 column</u> in a `Series`. Refer to [the documentation](https://pandas.pydata.org/pandas-docs/stable/advanced.html) for other ways a `Series` can be represented.
</div>

In [19]:
# Retrieve a series from a dataframe and verify that the datatype is a Series
emp_ids = df['employee_id']

In [20]:
# Use Series.head() to show the first records
emp_ids.head()

0    44037036
1    41036228
2    67601397
3    64972576
4    59922079
Name: employee_id, dtype: int64

In [28]:
# Use Series.sample() to sample records from the Series
# (Note the indices of the Series. They are random.)
emp_ids.sample()

261    10799071
Name: employee_id, dtype: int64

In [29]:
# Verify that the datatype is a Series
print(type(emp_ids))
print(type(df))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


#### Retrieve a record from a `DataFrame`
To retrieve just 1 record from the `DataFrame`, use `.iloc` and specify the index of the record. Put this value in the square brackets `[]`.

Note: When you do this, the datatype of the result is a `Series`. Use the index in the `Series` to retrieve the value.

In [16]:
# Exercise: How do you show the first 5 records of the df?
df.head()

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
0,44037036,55000.0,Account Specialist,OWN
1,41036228,184000.0,IT Architect,MORTGAGE
2,67601397,30000.0,night auditor,RENT
3,64972576,72000.0,Field Service Representative,MORTGAGE
4,59922079,84000.0,Senior Pastor,MORTGAGE


In [17]:
# Retrieve the second record from the df using .iloc[] and validate the result. (index locate)
# Store this in a variable called row_2
row_2 = df.iloc[1]
row_2

employee_id           41036228
annual_inc              184000
employee_title    IT Architect
home_ownership        MORTGAGE
Name: 1, dtype: object

In [30]:
# Verify that the datatype is a Series, where the index are now the column names.
print(type(row_2))

<class 'pandas.core.series.Series'>


In [32]:
#Get the employee's annual_inc
annual_inc = row_2['annual_inc']
print(annual_inc)

184000.0


In [33]:
#You can use iloc for Series too. Retrieve the first record in annual_incs.
# Retreve the first record from the annual_incs Series using.iloc
emp_ids.iloc[0]

44037036

In [40]:
# Complete the following code to retrieve the employee_id and employee_title and home_ownership
emp_id = row_2['employee_id']
emp_title = row_2['employee_title']
emp_home = row_2['home_ownership']

# Then, fill in the blanks in the following statement.
print('The employee with ID ' + str(emp_id) + ' has the job title ' + emp_title + ' and has a ' + emp_home + '.')

The employee with ID 41036228 has the job title IT Architect and has a MORTGAGE.


In [41]:
# Your turn: What is the employee ID and annual income of the 11th employee (index=10)
# Write your code here

row_11 = df.iloc[10]

emp_10id = row_11['employee_id']
emp_10income = row_11['annual_inc']

print('The employee with ID' + str(emp_10id) + ' has $' + str(emp_10income) + '.')

# emp_10id = 0.0
# emp_10annual_inc = 0.0

# Then, fill in the blanks in the following statement. Debug the code by 
# specifying the correct datatypes of the variables.
# print('The employee with ID ' + emp_9id + ' has an annual income of ' + emp_9annual_inc + '.')


#Retrieve the annual_inc of 

# First the row then the column
row10 = df.iloc[10]
print(row10['annual_inc'])

#First the column then the row
annual_incs = df['annual_inc']
print(annual_incs.iloc[10])

The employee with ID20967626 has $52000.0.
52000.0
52000.0


### Filter records from a `DataFrame`

#### Filtering by one condition, numeric

To perform filtering by numeric values, first **check that the column is a numeric column (`int64` or `float64`)**. It was already done when you used `df.dtypes`. Then, use the following signature to do so:
```python
df[df['column'] <conditional operator> <value>]
```
Take note of the open & close square brackets, and how a condition is expressed, where you use a **column name**, a **conditional operator** and a **value**.

In [37]:
# Before filtering, it is a good practice to use copy() first. This is done
# so modifications do not get reflected in the original df
df_above_300k = df.copy()

# Perform the filtering
df_above_300k = df_above_300k[
                    df_above_300k['annual_inc'] >= 300000 
        ]

df_above_300k

# Q. Identify all records in the df where the annual_inc is 300000 or greater.
#

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
127,31127612,500000.0,CEO,MORTGAGE
627,35843557,345000.0,Director,MORTGAGE
657,4629113,340000.0,Citigroup,RENT
695,7757331,600000.0,Kildare Capital,RENT
843,65509222,325000.0,Global solutions leader,MORTGAGE
863,49282491,500000.0,Vice President,RENT
946,57415353,310000.0,Senior Vice President,RENT


#### Filtering by one condition, using string comparators

To perform filtering using a string function, use `==` e.g. `== 'OWN'`. Follow the above notation using a column, a comparator and a value in the same order.

In [38]:
# Copy the df
df_home_ownership_own = df.copy()

# Filter for all records where homeownership has the value 'OWN'
df_home_ownership_own = df_home_ownership_own[
            df_home_ownership_own['home_ownership'] == "OWN"
]

df_home_ownership_own.head()

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
0,44037036,55000.0,Account Specialist,OWN
19,35845083,45000.0,claims,OWN
25,34691613,50000.0,General mgr,OWN
32,36665106,73000.0,VP Operations,OWN
43,10018508,75000.0,Police Officer,OWN


You can find records satisfying multiple string matches with the `isin` function. First state the **column** you would want to filter on. Then, like calling a function, use a `.` symbol followed by `.isin()`. Finally, put the values of interest **as a list**. Put this list as a parameter of the `isin()` function.

In [40]:
# Copy the df
df_of_interest = df.copy()
titles = ['Accountant', 'Sales']
df_of_interest = df_of_interest[
    df_of_interest['employee_title'].isin(titles)   
]

df_of_interest

# Filter for all records where employee_title is 'Accountant' or 'Sales'
#

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
93,70641473,100000.0,Sales,MORTGAGE
98,65559791,85000.0,Sales,MORTGAGE
170,63773407,55000.0,Sales,MORTGAGE
247,47357024,55000.0,Sales,RENT
485,35924591,130000.0,Sales,MORTGAGE
582,19867982,62000.0,Sales,RENT
807,31726709,85000.0,Accountant,RENT
811,67688327,72000.0,Sales,MORTGAGE
839,63613406,60000.0,Accountant,RENT
913,70334794,120000.0,Sales,RENT


#### Filtering by two conditions using the AND logic

If you would like to filter for two conditions, using `AND` as the logical operator, put a `&` in your expression. For every condition, surround them with round brackets `(` and `)`.

In [42]:
# Copy the df
df_of_interest = df.copy()

# Filter for all records where the employee_title is President and the annual_inc is 225000 or greater
df_of_interest = df_of_interest[
    (df_of_interest['employee_title'] == 'President') &
    (df_of_interest['annual_inc'] >= 250000)
    # & (<condition>)...
]

df_of_interest

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
238,370086,250000.0,President,MORTGAGE


In [42]:
# Contrast this with the following filter:
# Q. Find all employees with the title 'President'
# Your turn: Can you solve this in-class exercise?
df_of_interest = df.copy()

df_of_interest = df_of_interest[
    df_of_interest['employee_title'] == 'President'    
]

df_of_interest

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
7,37075903,108000.0,President,RENT
72,64555876,100000.0,President,MORTGAGE
238,370086,250000.0,President,MORTGAGE
293,23052676,100000.0,President,RENT
515,66185317,80000.0,President,OWN
775,11217187,72000.0,President,RENT


In [44]:
# Contrast this with the following filter:
# Q. Find all employees who have an annual_inc of 225000 or more.
# Your turn: Can you solve this in-class exercise?

df_of_interest = df.copy()

df_of_interest = df_of_interest[
    df_of_interest['annual_inc'] >= 225000
]

df_of_interest

Unnamed: 0,employee_id,annual_inc,employee_title,home_ownership
116,41302339,235000.0,Account Manager,MORTGAGE
127,31127612,500000.0,CEO,MORTGAGE
238,370086,250000.0,President,MORTGAGE
338,43848504,248000.0,Editor,RENT
370,68317904,250000.0,Speech Therapist,MORTGAGE
411,31706813,260000.0,Advisor,MORTGAGE
505,42454613,250000.0,Chief Technology Officer,MORTGAGE
627,35843557,345000.0,Director,MORTGAGE
657,4629113,340000.0,Citigroup,RENT
695,7757331,600000.0,Kildare Capital,RENT


**Credits**
- [Lending Club Loan Data, Kaggle](https://www.kaggle.com/wendykan/lending-club-loan-data) for the dataset
<hr>
`HWA-DS102-INCLASS-1B-201810`