# `pandas` Part 1: this notebook is a first lesson on `pandas`
## The main objective of this tutorial is to introduce `pandas` and create some DataFrames
>- Pandas is one of, if not the, most popular modules for data analytics/science projects
>- We will pretty much be learning about pandas from here until the final 


# Learning Objectives
## By the end of this tutorial you will be able to:
1. Import the `pandas` module and give it an alias
2. Define a pandas DataFrame and Series
3. Create a pandas DataFrame from scratch
4. Create a pandas DataFrame by reading an Excel file
5. Create a pandas DataFrame by reading a csv file
6. Examine your DataFrames using the `shape` and `head()` functions

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
>- Note the `as pd` is optional but is a common alias used for pandas and makes writing the code a bit easier
2. Create or load data into a pandas DataFrame or Series
>- In practice, you will likely be loading more datasets than creating but we will learn both
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# First, check your working directory
>- Your working directory is where you are "working"
>- In other words, where you are opening and saving files. 
>>- in this class, you jupyter notebooks

In [1]:
import os

os.getcwd()

'C:\\Users\\Cupcake\\Python'

In [2]:
os.listdir()

['.ipynb_checkpoints',
 'BooleanExpressions_StudentNotes.ipynb',
 'ControlFlow_If_elseIf_StudentNotes.ipynb',
 'ControlFlow_Loops_Student.ipynb',
 'Dictionaries_Type-Along_student.ipynb',
 'Functions_Student.ipynb',
 'Gibson_Daniel_Midterm.ipynb',
 'Gibson_Daniel_Quiz2.ipynb',
 'Gibson_Daniel_Quiz3.ipynb',
 'Gibson_Daniel_Quiz4Backup.ipynb',
 'Gibson_Daniel_Quiz5.ipynb',
 'Gibson_Daniel_Quiz6.ipynb',
 'Gibson_Daniel_W5D1Warmup.ipynb',
 'Gibson_Daniel_W7D1-InClass-Dict&List_Review.ipynb',
 'Gibson_Daniel_W7D1-InClass-ListReview.ipynb',
 'Lists_Student_type-along.ipynb',
 'ManipulatingStrings_Student.ipynb',
 'Midterm Review Lab Questions_Student.ipynb',
 'pandas1&2_Read_Index_Select_Activity_student.ipynb',
 'Pandas_Part1_Student.ipynb',
 'Python_Functions_Homework.ipynb',
 'Quiz#3 - Intro Flow Control - Student.ipynb',
 'Quiz3DG.ipynb',
 'Quiz4.ipynb',
 'Quiz5.ipynb',
 'students.csv',
 'students100.xlsx',
 'Untitled.ipynb',
 'W3D1_Warmup.ipynb',
 'W6D1-Warmup.ipynb',
 'WaarmupPractice1

### Note:  if you have a lot of files like I do you might want to run a loop to find the one you want

In [7]:
for file in os.listdir():
    if file.find('wine') == 0:
        print(file)

winemag-data-130k-v2.csv


In [4]:
'winemag-data-130k-v2.csv' in os.listdir()

True

# Step 1: Import pandas and give it an alias

In [8]:
import pandas as pd

# Step 2: Create a pandas `DataFrame`
## Definition: a `DataFrame` is a table
>- A `DataFrame` is nothing different than an Excel table or table in a SQL database
>- A `DataFrame` contains rows/records and columns

### Let's make a `DataFrame` in the next cell with the `DataFrame` function

In [9]:
pd.DataFrame({'Yes': [50,21], 'No': [131,2]})

Unnamed: 0,Yes,No
0,50,131
1,21,2


### Notes on the previous example:
1. We use the `pd.DataFrame({})` constructor to create a DataFrame from scratch
2. Note we used dictionary syntax where the keys are the column names and the values are the lists of values for either  'Yes' or 'No' 
3. The numbers in the far left column are autogenerated index values
>- These values will uniquely identify every row/record in the DataFrame
>- We can specify our own index values with an index parameter after the dictionary
4. This is the most common way of constructing a DataFrame 

### Make another `DataFrame` with string data
>- Suppose we are collecting feedback on several products
>- We can store the data from various customers/reviewers with a DataFrame

In [10]:
pd.DataFrame({'Bob': ['I liked it', 'It was awful'],
             'Sue': ['Pretty good', 'Bland']})

Unnamed: 0,Bob,Sue
0,I liked it,Pretty good
1,It was awful,Bland


### Now add our own index values instead of the auto-generated numbers

In [11]:
pd.DataFrame({'Bob': ['I liked it', 'It was awful'],
             'Sue': ['Pretty good', 'Bland']},
            index = ['Product A', 'Product B'])

Unnamed: 0,Bob,Sue
Product A,I liked it,Pretty good
Product B,It was awful,Bland


# Step 2 (part b) with `Series`
## Definition: a `Series` is a sequence of data values
>- Essentially a `Series` can be thought of a single column of a `DataFrame`
>>- And a `DataFrame` can be thought of as a bunch of `Series` appended together

### Let's make a `Series` or two in the next few cells

In [12]:
pd.Series([1,2,3,4,5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [13]:
pd.Series([30, 35, 50],
          index = ['2015 Sales','2016 Sales','2017 Sales'])

2015 Sales    30
2016 Sales    35
2017 Sales    50
dtype: int64

# Step 2 (part c) Read Data Into a DataFrame
>- Knowing how to create your own data can be useful
>- However, most of the time we will read data into a DataFrame from a csv or Excel file

## File Needed: `winemag-data-130k-v2.csv`
>- Make sure you download this file from Canvas and place in your working directory

### Read the csv file with `pd.read_csv('fileName.csv'`)

In [14]:
wine_reviews = pd.read_csv('winemag-data-130k-v2.csv')

### Check how many rows/records and columns are in the the `wine_reviews` DataFrame
>- Use `shape`

In [16]:
wine_reviews.shape

(129971, 14)

### The output returned by `shape` tells us how many rows and columns are in our DataFrame
>- Number of rows: 129,971
>- Number of columns: 14

### Now view a sample of 5 rows of data with `head()`

In [18]:
wine_reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


### Notice how it looks like we have two index rows
>- This is because the csv file already had an index column but pandas did not automatically code that as the index
>- Similar to how we set the index in the DataFrames we created, we can set the `index_col` parameter when we read in data

In [21]:
wine_reviews = pd.read_csv('winemag-data-130k-v2.csv', index_col = 0) #zero is the first column posotion which includes r key / record id /index

In [22]:
wine_reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
