# `pandas` Part 3: Descriptive Analytics with `pandas`

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Understand the three fundamental areas of an analytics project: Descriptive, Predictive, Prescriptive
2. Summarize data using `describe()`
>- Descriptive Analytics is the first layer of a full analytical report and `describe()` gets us started 
3. Transform data with simple calculations

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

Type-along narrations:

- Part 1: https://youtu.be/CuCcR_ZP8AM
- Part 2: https://youtu.be/fA0VCu5FQiE

# Analytics Project Framework Notes
## A complete and thorough analytics project will have 3 main areas
1. Descriptive Analytics: tells us what has happened or what is happening. 
>- The focus of this lesson is how to do this in python.
>- Many companies are at this level but not much more than this
>- Descriptive statistics (mean, median, mode, frequencies)
>- Graphical analysis (bar charts, pie charts, histograms, box-plots, etc)
2. Predictive Analytics: tells us what is likely to happen next
>- Less companies are at this level but are slowly getting there
>- Predictive statistics ("machine learning (ML)" using regression, multi-way frequency analysis, etc)
>- Graphical analysis (scatter plots with regression lines, decision trees, etc)
3. Prescriptive Analytics: tells us what to do based on the analysis
>- Synthesis and Report writing: executive summaries, data-based decision making
>- No analysis is complete without a written report with at least an executive summary
>- Communicate results of analysis to both non-technical and technical audiences

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

#### Note: setting our working directory to a variable named `path` will make accessing files in the directory easier

# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `winemag-data-130k-v2.csv`
>- Set the index to column 0
>- Note: by defining `path` above for our working directory we can then just concatenate our working directory with the file we wish to read in

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

### Check a couple of rows of data

# Descriptive Analytics with `describe()`
>- General syntax: dataFrame.columnName.describe()

### Now, what is/are the question(s) being asked of the data? 
>- All analytics projects start with questions (from you, your boss, some decision maker, etc)

#### For this example...
##### Question: What is the summary information about wine point ratings? 
>- subQ1: What is a baseline/average wine?
>>- What is the average rating?
>>- What is the median rating? 
>- subQ2: What is the range of wine ratings? 
>>- What is the lowest rating? The highest rating? 
>- subQ3: What rating is the lowest for the top 25% of wines?

### The cool thing about learning `python` and in particular `pandas` is you can answer all these with a few lines of code

### Notes on `describe()`
>- `describe()` is "type-aware" which means it will automatically give summary statistics based on the data type of the column
>- In the previous example, `describe()` gave us summary stats based on a numerical column
>- For a string column, we can't calculate a mean, median or standard deviation so we get different output from `describe()`

### Another question to be answered with analytics:
##### What information do we have regarding wine tasters? 
>- subQ1: How many total wine tasters are there? 
>- subQ2: How many total records have a wine taster mapped to them? 
>- subQ3: Who has the most wine tastings? 
>>- How many wine tastings does he or she have? 

### Notes on the previous output:
>- count gives us the total number of records with non-null taster_name
>- unique gives us the total number of taster_name names
>- top gives us the taster_name with the most records
>- freq gives us the number of records for the top taster_name

# Getting specific summary stats and assigning a variable to them
>- To be able to write our results in a nice executive summary format, assign variables to specific summary stat values

### Assign variables for the mean points

### Create a list of the wine tasters with `unique()`

### To see a list of wine tasters and how often they occur use `value_counts()`

### Q: Which tasters have 10,000 or more reviews?
>- Filtering results using `where()`
>- Remove results not meeting criteria with with `dropna()`

#### How many reviews did Roger Voss have?
>- Find the count for one particular reviewer using `loc`

### Who are the top five wine tasters by number of occurrences? 

# Transforming data
>- Sometimes it is useful to standardize/normalize data
>- Standardizing data allows you to make comparisons regardless of the scale of the original data
>- We can transform data using some simple operations
>>- For more advanced transformations we can use `map()` and `apply()`

### Transforming the `points` column
>- In this example we will "remean" our points column to a mean of zero

##### Our new `points0` variable should have a mean of 0

### Now assign a new column name, `points0`, to the data frame and insert the `points0` values
>- We will do this two different ways

#### Method1: This way creates and inserts the new column at the end of the DataFrame

#### Method2: Using `insert()` allows us to specify the position of our new column
>- Insert general syntax and parameters: insert(insertion index, column name, values, allow duplicates)
>>- Insertion Index: where do you want your column in your DataFrame
>>- Column Name: the name of your new column
>>- Values: the values you want stored in your new column
>>- Allow Duplicates: Set to `True` if duplicate values are ok

### We can also concatenate fields and store that in a new column of our DataFrame
#### Task: Combine the country and province fields into one field separated with a ' - '
>- Insert the concatenated field into the wineReviews dataframe as 'countryProv'