# `pandas` Part 3: Descriptive Analytics with `pandas`

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Understand the three fundamental areas of an analytics project: Descriptive, Predictive, Prescriptive
2. Summarize data using `describe()`
>- Descriptive Analytics is the first layer of a full analytical report and `describe()` gets us started 
3. Transform data with simple calculations

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Analytics Project Framework Notes
## A complete and thorough analytics project will have 3 main areas
1. Descriptive Analytics: tells us what has happened or what is happening. 
>- The focus of this lesson is how to do this in python.
>- Many companies are at this level but not much more than this
>- Descriptive statistics (mean, median, mode, frequencies)
>- Graphical analysis (bar charts, pie charts, histograms, box-plots, etc)
2. Predictive Analytics: tells us what is likely to happen next
>- Less companies are at this level but are slowly getting there
>- Predictive statistics ("machine learning (ML)" using regression, multi-way frequency analysis, etc)
>- Graphical analysis (scatter plots with regression lines, decision trees, etc)
3. Prescriptive Analytics: tells us what to do based on the analysis
>- Synthesis and Report writing: executive summaries, data-based decision making
>- No analysis is complete without a written report with at least an executive summary
>- Communicate results of analysis to both non-technical and technical audiences

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

In [1]:
import os, pandas as pd
os.getcwd()

'C:\\Users\\Cupcake\\Python'

#### Note: setting our working directory to a variable named `path` will make accessing files in the directory easier

# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `winemag-data-130k-v2.csv`
>- Set the index to column 0
>- Note: by defining `path` above for our working directory we can then just concatenate our working directory with the file we wish to read in

In [2]:
wineReviews = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

In [6]:
rows = wineReviews.shape[0]
columns = wineReviews.shape[1]
totCells = rows * columns

print(f'''The wine dataset has:
        {rows} rows,
        {columns} columns,
        {totCells} total Entries
        ''')

The wine daataset has:
        129971 rows,
        13 columns,
        1689623 total Entries
        


### Check a couple of rows of data

In [7]:
wineReviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# Descriptive Analytics with `describe()`
>- General syntax: dataFrame.columnName.describe()

### Now, what is/are the question(s) being asked of the data? 
>- All analytics projects start with questions (from you, your boss, some decision maker, etc)

#### For this example...
##### Question: What is the summary information about wine point ratings? 
>- subQ1: What is a baseline/average wine?
>>- What is the average rating?
>>- What is the median rating? 
>- subQ2: What is the range of wine ratings? 
>>- What is the lowest rating? The highest rating? 
>- subQ3: What rating is the lowest for the top 25% of wines?

### The cool thing about learning `python` and in particular `pandas` is you can answer all these with a few lines of code

In [8]:
wineReviews.points.describe()

count    129971.000000
mean         88.447138
std           3.039730
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64

### Notes on `describe()`
>- `describe()` is "type-aware" which means it will automatically give summary statistics based on the data type of the column
>- In the previous example, `describe()` gave us summary stats based on a numerical column
>- For a string column, we can't calculate a mean, median or standard deviation so we get different output from `describe()`

### Another question to be answered with analytics:
##### What information do we have regarding wine tasters? 
>- subQ1: How many total wine tasters are there? 
>- subQ2: How many total records have a wine taster mapped to them? 
>- subQ3: Who has the most wine tastings? 
>>- How many wine tastings does he or she have? 

In [10]:
wineReviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

### Notes on the previous output:
>- count gives us the total number of records with non-null taster_name
>- unique gives us the total number of taster_name names
>- top gives us the taster_name with the most records
>- freq gives us the number of records for the top taster_name

# Getting specific summary stats and assigning a variable to them
>- To be able to write our results in a nice executive summary format, assign variables to specific summary stat values

### Assign variables for the mean points

In [11]:
meanPoints = wineReviews.points.mean()

meanPoints

88.44713820775404

### Create a list of the wine tasters with `unique()`

In [12]:
tasters = wineReviews.taster_name.unique()

tasters

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

### To see a list of wine tasters and how often they occur use `value_counts()`

In [13]:
tasterCount = wineReviews.taster_name.value_counts()

tasterCount

Roger Voss            25514
Michael Schachner     15134
Kerin O’Keefe         10776
Virginie Boone         9537
Paul Gregutt           9532
Matt Kettmann          6332
Joe Czerwinski         5147
Sean P. Sullivan       4966
Anna Lee C. Iijima     4415
Jim Gordon             4177
Anne Krebiehl MW       3685
Lauren Buzzeo          1835
Susan Kostrzewa        1085
Mike DeSimone           514
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: taster_name, dtype: int64

### Q: Which tasters have 10,000 or more reviews?
>- Filtering results using `where()`
>- Remove results not meeting criteria with with `dropna()`

In [20]:
tasterCount.where(tasterCount > 10000).dropna()

Roger Voss           25514.0
Michael Schachner    15134.0
Kerin O’Keefe        10776.0
Name: taster_name, dtype: float64

#### How many reviews did Roger Voss have?
>- Find the count for one particular reviewer using `loc`

In [28]:
tasterCount.loc['Roger Voss']

25514

### Who are the top five wine tasters by number of occurrences? 

In [29]:
tasterCount[:5]

Roger Voss           25514
Michael Schachner    15134
Kerin O’Keefe        10776
Virginie Boone        9537
Paul Gregutt          9532
Name: taster_name, dtype: int64

# Transforming data
>- Sometimes it is useful to standardize/normalize data
>- Standardizing data allows you to make comparisons regardless of the scale of the original data
>- We can transform data using some simple operations
>>- For more advanced transformations we can use `map()` and `apply()`

### Transforming the `points` column
>- In this example we will "remean" our points column to a mean of zero

In [30]:
# Recall: earlier we defined meanPoints on the orgininal data

meanPoints

88.44713820775404

In [31]:
points0 = wineReviews.points - meanPoints

##### Our new `points0` variable should have a mean of 0

In [33]:
points0.mean()

-1.2830454158312965e-14

### Now assign a new column name, `points0`, to the data frame and insert the `points0` values
>- We will do this two different ways

#### Method1: This way creates and inserts the new column at the end of the DataFrame

In [35]:
wineReviews['points0'] = points0
wineReviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,points0
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.447138
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.447138
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,-1.447138
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,-1.447138
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,-1.447138


#### Method2: Using `insert()` allows us to specify the position of our new column
>- Insert general syntax and parameters: insert(insertion index, column name, values, allow duplicates)
>>- Insertion Index: where do you want your column in your DataFrame
>>- Column Name: the name of your new column
>>- Values: the values you want stored in your new column
>>- Allow Duplicates: Set to `True` if duplicate values are ok

In [37]:
wineReviews.insert(4,'points0',points0,True)

In [38]:
wineReviews.head()

Unnamed: 0,country,description,designation,points,points0,points0.1,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,points0.2
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,-1.447138,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.447138
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,-1.447138,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.447138
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,-1.447138,-1.447138,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,-1.447138
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,-1.447138,-1.447138,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,-1.447138
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,-1.447138,-1.447138,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,-1.447138


### We can also concatenate fields and store that in a new column of our DataFrame
#### Task: Combine the country and province fields into one field separated with a ' - '
>- Insert the concatenated field into the wineReviews dataframe as 'countryProv'

In [40]:
countryProv = wineReviews.country + " - " + wineReviews.province

wineReviews['countryProv'] = countryProv

wineReviews.head()

Unnamed: 0,country,description,designation,points,points0,points0.1,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,points0.2,countryProv
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,-1.447138,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.447138,Italy - Sicily & Sardinia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,-1.447138,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.447138,Portugal - Douro
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,-1.447138,-1.447138,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,-1.447138,US - Oregon
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,-1.447138,-1.447138,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,-1.447138,US - Michigan
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,-1.447138,-1.447138,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,-1.447138,US - Oregon
