# Week 9 
## In-Class Activity Workbook and Homework

## Learning Objectives 
### In this notebook you will learn about and practice:
1. Section 1: <a id='Section 1'></a>[Section 1: Reading and examining files with pandas](#Section-1)
2. Section 2: <a id='Section 2'></a>[Section 2: Selecting Data with pandas](#Section-2)
3. Section 3: <a id='Section 3'></a>[Section 3: More practice selecting data](#Section-3)
4. Section 4: <a id='Section 4'></a>[Section 4: Conditional Selection](#Section-4)

### Additional Sources
>- Check out the `pandas` cheat sheets provided by Data Camp and posted on Canvas
>>- https://www.datacamp.com/community/blog/python-pandas-cheat-sheet

# Section 1

## What is the `pandas` module?
>- pandas is a flexible data analysis library built within the C language and is one of the fastest ways of getting from zero to answer
>- `pandas` is the go to tool for most business analysts and scientists working in python and learning to proficient in `pandas` will do wonders to your productivity and look great on your resume
>- Some say `pandas` is basically Excel on steroids
    >- `pandas` can be thought of as a mix of Python and SQL so if you know SQL working with `pandas` may come easier to you but knowing SQL is not a prerequisite for working in `pandas`
    
### Some of the useful ways in which you can use the `pandas` module include:

1. Transforming tabular data into python to work with
2. Cleaning and filtering data, whether it's missing or incomplete
3. Feature engineer new columns that can be applied in your analysis
4. Calculating statistics that answer questions (mean, median, max, min, etc)
5. Finding correlations between columns
6. Visualizing data with matplotlib


## Reading and Writing Files with the Python `pandas` Module

### Read csv or Excel files
>- csv files: `pd.read_csv('fileName.csv')`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Multiple sheets from the same Excel file: 
>>- `xlsx = pd.ExcelFile('file.xls')` # reads in the entire workbook
>>- `df1 = pd.read_excel(xlsx, 'Sheet1')`  # reads in sheet you specify
>>- `df2 = pd.read_excel(xlsx, 'Sheet2')`

### Write csv or Excel files
>- csv files: `pd.to_csv('YourDataFrame.csv')`
>- Excel files: `pd.to_excel('YourDataFrame.xlsx')`

# Section 1
## Reading Files and Initial Data Examination with `pandas`

## Creating a `stu` DataFrame
>- Complete the following steps to practice reading in a csv file
>>- Note: You should download the `students.csv` and `students100.xlsx` files from Canvas and save it in the same director/folder that you have this notebook saved in
1. Import the pandas module and alias it `pd`

### Step 1: Check your working directory and make sure you have the `students.csv` and `students100.xlxs` files there
>- Note: There are several ways to do this

In [9]:
import os
os.listdir()

['.ipynb_checkpoints',
 'anscombe_quartet.csv',
 'loc vs iloc.ipynb',
 'pandas1%262_Read_Index_Select_Activity_student.ipynb',
 'pandasCheatSheet_DataCamp.pdf',
 'Pandas_Part1_Student.ipynb',
 'Pandas_Part2_Indexing-Selecting-Assigning_STUDENT.ipynb',
 'students.csv',
 'students100.xlsx',
 'winemag-data-130k-v2.csv']

### Step 2: import the `pandas` module and alias it `pd`

In [10]:
import pandas as pd

### Step 3: Read the `students.csv` file into a pandas dataframe named, `stu`
>- Look at the first five records to make sure `stu` is imported correctly

#### Loading a CSV file
function: `pd.read_csv()`

[Docu read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html?highlight=read_excel#pandas.read_excel)

In [12]:
import pandas as pd
stu = pd.read_csv('students.csv')
stu.head()

Unnamed: 0,studentID,firstName,lastName,birthdate,Points
0,1,Amy,Willis,10/23/1991,18.032651
1,2,Donald,Pierce,4/7/1990,79.671554
2,3,Adam,Holmes,5/16/1991,10.495381
3,4,Patrick,Payne,12/29/1990,33.449285
4,5,Chris,Lynch,10/3/1990,33.654615


#### Now, set the index column of `stu` to the studentID column
>- Note: make sure to make this change in-place
>- There are several ways to do this. Below are couple of options

1. Use the `index_col` option when reading in the file
2. Look up the `set_index` method and apply it to `stu`

#### Look at the first five records after you have set the index

#### Show the last five records of `stu`

#### Show a tuple of the number of columns and rows in `stu`

In [14]:
stu.shape
#print a sentence with an f-string similar "there are N rows and M columns."
stu.shape[0]

100

#### Show the number rows another way

#### Show the columns in `stu`

#### Show the datatypes that are in `stu`

# Section 2
## Accessing Data using `pandas`

#### Access all the records of the `firstName` Column Only
>- Try doing this three different ways

#### Show all the data for rows 3 to 10

#### Show all the data in the last row

#### Show the first 3 records of the third column
>- Try doing this in three different ways

#### Show the last 3 records of the first and last columns
>- Try doing this in three different ways

# Section 3
## More practice and notes on selecting data
>- `iloc` - index based selection
>- `loc` - label based selection

### Additional Source for DataFrame Methods

[Docu DF Methods](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

## `.iloc[arg1, arg2]`

accesses data by row and column  (i stands for integer)

`arg1`: row number

`arg2`: column number

first row has index 0 <br>
first column has index 0 <br>
index column is always shown (not included in column count)


### First, let's remind ourselves what the `stu` DataFrame looks like
>- Show the first five records of the `stu` DataFrame

In [16]:
#stu.head()
stu[0:5]

Unnamed: 0,studentID,firstName,lastName,birthdate,Points
0,1,Amy,Willis,10/23/1991,18.032651
1,2,Donald,Pierce,4/7/1990,79.671554
2,3,Adam,Holmes,5/16/1991,10.495381
3,4,Patrick,Payne,12/29/1990,33.449285
4,5,Chris,Lynch,10/3/1990,33.654615


### Using `iloc[]`, show the first ten records of the first 2 columns of `stu`

In [22]:
stu.iloc[0:10, 0:2]

Unnamed: 0,studentID,firstName
0,1,Amy
1,2,Donald
2,3,Adam
3,4,Patrick
4,5,Chris
5,6,Clarence
6,7,James
7,8,Barbara
8,9,Louis
9,10,Dennis


### Using `iloc[]` show rows 2-10 for the last two columns of `stu`

In [29]:
stu.set_index('studentID')
stu.iloc[2:11, -2:]

Unnamed: 0,birthdate,Points
2,5/16/1991,10.495381
3,12/29/1990,33.449285
4,10/3/1990,33.654615
5,4/29/1988,79.655349
6,10/17/1989,8.996545
7,12/5/1991,60.141281
8,12/13/1990,58.775231
9,12/7/1990,29.948968
10,6/25/1988,8.550434


### Using `iloc[]` show rows 85 to the end of the DataFrame for all columns

In [31]:
stu.iloc[85:, :]

Unnamed: 0,studentID,firstName,lastName,birthdate,Points
85,86,Charles,Martinez,4/7/1988,83.86649
86,87,Anthony,Baker,8/14/1989,68.802021
87,88,Matthew,Sims,8/14/1990,78.932542
88,89,Robin,Collins,9/9/1991,1.619006
89,90,Ann,Powell,2/27/1988,54.369955
90,91,Earl,Robertson,1/26/1989,45.399924
91,92,Helen,Morrison,5/13/1988,90.407461
92,93,David,Dean,9/28/1989,78.979862
93,94,Alice,Fox,8/14/1989,27.961554
94,95,Donald,Nichols,5/17/1988,71.40695


### Using `iloc[]`, show columns 2 to 4 (lastName to Points) for all records

In [34]:
stu.iloc[: ,2:5]

Unnamed: 0,lastName,birthdate,Points
0,Willis,10/23/1991,18.032651
1,Pierce,4/7/1990,79.671554
2,Holmes,5/16/1991,10.495381
3,Payne,12/29/1990,33.449285
4,Lynch,10/3/1990,33.654615
...,...,...,...
95,Cunningham,11/5/1991,84.951415
96,Lynch,3/18/1990,4.377005
97,Little,12/23/1988,52.021577
98,Taylor,12/6/1989,13.154354


### Using `iloc[]`, show rows 20 to 25, columns 1 to 2 (firstname and lastname)

### Using `iloc[]`, show rows 1,5,7 and columns 1,4 (firstName and Points)

## `.loc`, label-based selection

>- Use index values to show specific rows
>- Use column labels to show specific columns

### Using `loc` show students with student number 75-80

In [38]:
stu.head()
stu.set_index('studentID',inplace=True)
stu.loc[75:80]

Unnamed: 0_level_0,firstName,lastName,birthdate,Points
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
75,Norma,Lynch,12/2/1991,46.269019
76,Jason,Freeman,6/1/1989,34.684327
77,Thomas,Lawrence,3/15/1991,90.077218
78,Daniel,Wallace,3/25/1991,82.183959
79,Charles,Austin,10/6/1990,72.17329
80,Brandon,Clark,11/8/1990,99.585281


### Using `loc` show the `birthdate` of studentIds of 10-15

### Using `loc`, show the birthdate and points for student ids between 3 and 8

### Using `loc`, show the last name and points for student ids of 1, 50, and 100

# Section 4
## Conditional Selection with `loc`

### Using `loc` show the students with point values greater than 90

### Using `loc` show students with points between 50 and 60

### Show students with points either less than 10 or greater than 90

### Find students with last name 'Holmes'

### Find students whose birth date is in 1989
>- Hint: apply `.str.endswith()` when accessing the `birthdate` column
>>- Check the data type of the birthdate field to understand why we can use a string method on a date

### Find students whose first name starts with 'A'

### Find students whose `lastName` starts with 'C' that have 90 or more `Points`

### Find students whose `lastName` starts with 'C' or 'B' with 80 or more `Points`