# `pandas` Part 1: this notebook is a first lesson on `pandas`
## The main objective of this tutorial is to introduce `pandas` and create some DataFrames
>- Pandas is one of, if not the, most popular modules for data analytics/science projects
>- We will pretty much be learning about pandas from here until the final 


# Learning Objectives
## By the end of this tutorial you will be able to:
1. Import the `pandas` module and give it an alias
2. Define a pandas DataFrame and Series
3. Create a pandas DataFrame from scratch
4. Create a pandas DataFrame by reading an Excel file
5. Create a pandas DataFrame by reading a csv file
6. Examine your DataFrames using the `shape` and `head()` functions

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
>- Note the `as pd` is optional but is a common alias used for pandas and makes writing the code a bit easier
2. Create or load data into a pandas DataFrame or Series
>- In practice, you will likely be loading more datasets than creating but we will learn both
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# First, check your working directory
>- Your working directory is where you are "working"
>- In other words, where you are opening and saving files. 
>>- in this class, you jupyter notebooks

In [1]:
import os
print(os.getcwd())

C:\Users\CUBS Customer\Desktop\Python\Week 7


In [2]:
os.listdir()

['.ipynb_checkpoints',
 'More_Dictionaries.ipynb',
 'Pandas_Part1_Student.ipynb',
 'Week7_ManipulatingStrings_studentHW.ipynb',
 'winemag-data-130k-v2.csv']

In [3]:
for file in os.listdir():
    if '.csv' in file:
        print(file)

winemag-data-130k-v2.csv


### Note:  if you have a lot of files like I do you might want to run a loop to find the one you want

In [4]:
for file in os.listdir():
    if '.csv' in file:
        print(file)

winemag-data-130k-v2.csv


# Step 1: Import pandas and give it an alias

In [9]:
import pandas as pd

# Step 2: Create a pandas `DataFrame`
## Definition: a `DataFrame` is a table
>- A `DataFrame` is nothing different than an Excel table or table in a SQL database
>- A `DataFrame` contains rows/records and columns

### Let's make a `DataFrame` in the next cell with the `DataFrame` function

In [10]:
details = {
    "Name" : ["Drew", "Kyle", "Spencer", "Andrew"],
    "Age"  : [25,26,25,25],
    "job"  : ["DAS42", "EY", "BlueVector AI", "Jobless"]
}
details


pd.DataFrame(details)

Unnamed: 0,Name,Age,job
0,Drew,25,DAS42
1,Kyle,26,EY
2,Spencer,25,BlueVector AI
3,Andrew,25,Jobless


### Notes on the previous example:
1. We use the `pd.DataFrame({})` constructor to create a DataFrame from scratch
2. Note we used dictionary syntax where the keys are the column names and the values are the lists of values for either  'Yes' or 'No' 
3. The numbers in the far left column are autogenerated index values
>- These values will uniquely identify every row/record in the DataFrame
>- We can specify our own index values with an index parameter after the dictionary
4. This is the most common way of constructing a DataFrame 

### Make another `DataFrame` with string data
>- Suppose we are collecting feedback on several products
>- We can store the data from various customers/reviewers with a DataFrame

In [13]:
from io import StringIO


string_data = StringIO ("""Date;Event;Cost 
    10/2/2011;Music;10000 
    11/2/2011;Poetry;12000 
    12/2/2011;Theatre;5000 
    13/2/2011;Comedy;8000 
    """)
string_data
df = pd.read_csv(string_data, sep = ';')
df

Unnamed: 0,Date,Event,Cost
0,10/2/2011,Music,10000
1,11/2/2011,Poetry,12000
2,12/2/2011,Theatre,5000
3,13/2/2011,Comedy,8000


### Now add our own index values instead of the auto-generated numbers

In [14]:
df = df.set_index('Date')
df

Unnamed: 0_level_0,Event,Cost
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
10/2/2011,Music,10000
11/2/2011,Poetry,12000
12/2/2011,Theatre,5000
13/2/2011,Comedy,8000


# Step 2 (part b) with `Series`
## Definition: a `Series` is a sequence of data values
>- Essentially a `Series` can be thought of a single column of a `DataFrame`
>>- And a `DataFrame` can be thought of as a bunch of `Series` appended together

### Let's make a `Series` or two in the next few cells

In [17]:
music_column = df["Event"]
music_column

Date
    10/2/2011      Music
    11/2/2011     Poetry
    12/2/2011    Theatre
    13/2/2011     Comedy
Name: Event, dtype: object

In [20]:
import numpy as np


data = np.array(["C", 'o', 'l','o','r','a','d','o'])
data

ser = pd.Series(data)
ser

0    C
1    o
2    l
3    o
4    r
5    a
6    d
7    o
dtype: object

# Step 2 (part c) Read Data Into a DataFrame
>- Knowing how to create your own data can be useful
>- However, most of the time we will read data into a DataFrame from a csv or Excel file

## File Needed: `winemag-data-130k-v2.csv`
>- Make sure you download this file from Canvas and place in your working directory

### Read the csv file with `pd.read_csv('fileName.csv'`)

In [31]:
file = 'winemag-data-130k-v2.csv'
dat = pd.read_csv('winemag-data-130k-v2.csv')
dat.tail(5)
#you can use
#head and tail and shape

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
129966,129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


### Check how many rows/records and columns are in the the `wine_reviews` DataFrame
>- Use `shape`

In [27]:
dat.shape

(129971, 14)

### The output returned by `shape` tells us how many rows and columns are in our DataFrame
>- Number of rows: 129,971
>- Number of columns: 14

### Now view a sample of 5 rows of data with `head()`

In [30]:
dat.head(100)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


### Notice how it looks like we have two index rows
>- This is because the csv file already had an index column but pandas did not automatically code that as the index
>- Similar to how we set the index in the DataFrames we created, we can set the `index_col` parameter when we read in data

In [32]:
dat = pd.read_csv(file, index_col = 0)
dat.head(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
