# `pandas` Part 4: Grouping and Sorting

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Group data with `groupby()`
2. Sort data with `sort_values()`
 

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Analytics Project Framework Notes
## A complete and thorough analytics project will have 3 main areas
1. Descriptive Analytics: tells us what has happened or what is happening. 
>- The focus of this lesson is how to do this in python.
>- Many companies are at this level but not much more than this
>- Descriptive statistics (mean, median, mode, frequencies)
>- Graphical analysis (bar charts, pie charts, histograms, box-plots, etc)
2. Predictive Analytics: tells us what is likely to happen next
>- Less companies are at this level but are slowly getting there
>- Predictive statistics ("machine learning (ML)" using regression, multi-way frequency analysis, etc)
>- Graphical analysis (scatter plots with regression lines, decision trees, etc)
3. Prescriptive Analytics: tells us what to do based on the analysis
>- Synthesis and Report writing: executive summaries, data-based decision making
>- No analysis is complete without a written report with at least an executive summary
>- Communicate results of analysis to both non-technical and technical audiences

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

In [2]:
import os, pandas as pd

# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `winemag-data-130k-v2.csv`
>- Set the index to column 0

In [3]:
wineReviews = pd.read_csv("winemag-data-130k-v2.csv", index_col = 0)

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

In [4]:
rows = wineReviews.shape[0]
columns = wineReviews.shape[1]
totCells = rows * columns

print(f'''The wine dataset has:
        {rows} rows,
        {columns} columns,
        {totCells} total Entries
        ''')

The wine dataset has:
        129971 rows,
        13 columns,
        1689623 total Entries
        


### Check a couple of rows of data

In [5]:
wineReviews.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


# Descriptive Analytics with `groupby()`
>- General syntax: dataFrame.groupby(['fields to group by']).fieldsToanalyze.aggregation

### Now, what is/are the question(s) being asked of the data? 
>- All analytics projects start with questions (from you, your boss, some decision maker, etc)

###  How many wines have been rated at each point value?

In [6]:
wineReviews.groupby(['points']).points.count()

points
80       397
81       692
82      1836
83      3025
84      6480
85      9530
86     12600
87     16933
88     17207
89     12226
90     15410
91     11359
92      9613
93      6489
94      3758
95      1535
96       523
97       229
98        77
99        33
100       19
Name: points, dtype: int64

### How much does the least expensive wine for each point rating cost? 

In [9]:
minPrice = wineReviews.groupby(['points']).price.min()

minPrice

points
80      5.0
81      5.0
82      4.0
83      4.0
84      4.0
85      4.0
86      4.0
87      5.0
88      6.0
89      7.0
90      8.0
91      7.0
92     11.0
93     12.0
94     13.0
95     20.0
96     20.0
97     35.0
98     50.0
99     44.0
100    80.0
Name: price, dtype: float64

### Question: How much does the most expensive wine for each point rating cost?

In [10]:
maxPrice = wineReviews.groupby(['points']).price.max()

maxPrice

points
80       69.0
81      130.0
82      150.0
83      225.0
84      225.0
85      320.0
86      170.0
87      800.0
88     3300.0
89      500.0
90      510.0
91     2013.0
92      750.0
93      770.0
94     1125.0
95      973.0
96     2500.0
97     2000.0
98     1900.0
99      850.0
100    1500.0
Name: price, dtype: float64

### What is the overall maximum price for all wines?

In [11]:
max(maxPrice)

3300.0

### What is the lowest price for a wine rating of 100?

In [15]:
minPrice[100]

80.0

### What is the highest price for a wine rating of 80? 

In [16]:
maxPrice[80]

69.0

### What is the maximum rating for each country? 

In [17]:
countryMax = wineReviews.groupby(['country']).points.max()

countryMax

country
Argentina                  97
Armenia                    88
Australia                 100
Austria                    98
Bosnia and Herzegovina     88
Brazil                     89
Bulgaria                   91
Canada                     94
Chile                      95
China                      89
Croatia                    91
Cyprus                     89
Czech Republic             89
Egypt                      84
England                    95
France                    100
Georgia                    92
Germany                    98
Greece                     93
Hungary                    97
India                      93
Israel                     94
Italy                     100
Lebanon                    91
Luxembourg                 90
Macedonia                  89
Mexico                     92
Moldova                    91
Morocco                    93
New Zealand                95
Peru                       86
Portugal                  100
Romania                    92
Se

### What is the maximum rating for China?

In [18]:
countryMax['China']

89

##### Another way to get maximum ratring for China combining `where` and `groupby`

In [20]:
wineReviews.where(wineReviews['country']=='China').groupby(['country']).points.max()

country
China    89.0
Name: points, dtype: float64

### What are some summary stats for price for each country? 
>- Using the `agg()` function for specific summary stats
>>- What is the sample size?
>>- What is the minimum?
>>- What is the maximum?
>>- What is the mean?
>>- What is the median?
>>- What is the standard deviation? 

In [22]:
round(wineReviews.groupby(['country']).price.agg(['count','min','max','median','std']))

Unnamed: 0_level_0,count,min,max,median,std
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,3756,4.0,230.0,17.0,23.0
Armenia,2,14.0,15.0,14.0,1.0
Australia,2294,5.0,850.0,21.0,49.0
Austria,2799,7.0,1100.0,25.0,27.0
Bosnia and Herzegovina,2,12.0,13.0,12.0,1.0
Brazil,47,10.0,60.0,20.0,11.0
Bulgaria,141,8.0,100.0,13.0,10.0
Canada,254,12.0,120.0,30.0,20.0
Chile,4416,5.0,400.0,15.0,22.0
China,1,18.0,18.0,18.0,


In [24]:
countryAgg = wineReviews.groupby(['country']).price.describe()

countryAgg

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Argentina,3756.0,24.510117,23.430122,4.0,12.0,17.0,25.0,230.0
Armenia,2.0,14.5,0.707107,14.0,14.25,14.5,14.75,15.0
Australia,2294.0,35.437663,49.049458,5.0,15.0,21.0,38.0,850.0
Austria,2799.0,30.762772,27.224797,7.0,18.0,25.0,36.5,1100.0
Bosnia and Herzegovina,2.0,12.5,0.707107,12.0,12.25,12.5,12.75,13.0
Brazil,47.0,23.765957,11.053649,10.0,15.0,20.0,29.0,60.0
Bulgaria,141.0,14.64539,9.508744,8.0,10.0,13.0,16.0,100.0
Canada,254.0,35.712598,19.658148,12.0,21.0,30.0,40.75,120.0
Chile,4416.0,20.786458,21.929371,5.0,12.0,15.0,20.0,400.0
China,1.0,18.0,,18.0,18.0,18.0,18.0,18.0


## What are the descriptive analytics for country and province? 
>- We can group by multiple fields by adding more to our groupby() function

In [25]:
wineReviews.groupby(['country','province']).points.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Argentina,Mendoza Province,3264.0,86.826593,3.233570,80.0,84.00,87.0,89.00,97.0
Argentina,Other,536.0,86.001866,2.726470,80.0,84.00,86.0,88.00,95.0
Armenia,Armenia,2.0,87.500000,0.707107,87.0,87.25,87.5,87.75,88.0
Australia,Australia Other,245.0,85.518367,2.194598,80.0,84.00,85.0,87.00,93.0
Australia,New South Wales,85.0,87.694118,2.600474,82.0,86.00,88.0,90.00,94.0
...,...,...,...,...,...,...,...,...,...
Uruguay,Juanico,12.0,86.333333,3.498918,80.0,83.50,87.0,89.25,90.0
Uruguay,Montevideo,11.0,88.272727,2.493628,83.0,87.00,88.0,90.50,91.0
Uruguay,Progreso,11.0,86.818182,2.182576,82.0,86.50,87.0,88.00,90.0
Uruguay,San Jose,3.0,84.000000,2.645751,82.0,82.50,83.0,85.00,87.0


## What are the descriptive price analytics for the US?
>- Add `get_group()` syntax

In [26]:
wineReviews.groupby(['country']).get_group('US').price.describe()

count    54265.000000
mean        36.573464
std         27.088857
min          4.000000
25%         20.000000
50%         30.000000
75%         45.000000
max       2013.000000
Name: price, dtype: float64

## What are the summary wine rating stats for Colorado? 
>- Note that states are coded in this dataset under province

In [28]:
wineReviews.groupby(['country','province']).get_group(('US','Colorado')).points.describe()

count    68.000000
mean     86.117647
std       1.943450
min      80.000000
25%      85.000000
50%      86.000000
75%      87.000000
max      91.000000
Name: points, dtype: float64

In [29]:
wineReviews.groupby(['province']).get_group('Colorado').points.describe()

count    68.000000
mean     86.117647
std       1.943450
min      80.000000
25%      85.000000
50%      86.000000
75%      87.000000
max      91.000000
Name: points, dtype: float64

# Sorting Results
>- Add sort_values() syntax
>- Default is ascending order
## What are the summary stats for points for each country?
>- Sort the results from lowest to highest mean points

In [30]:
wineReviews.groupby(['country']).points.describe().sort_values(by='mean')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Peru,16.0,83.5625,1.860779,80.0,82.0,84.0,85.0,86.0
Egypt,1.0,84.0,,84.0,84.0,84.0,84.0,84.0
Ukraine,14.0,84.071429,1.59153,82.0,83.0,84.0,84.75,88.0
Brazil,52.0,84.673077,2.340782,80.0,83.0,85.0,86.0,89.0
Mexico,70.0,85.257143,2.722348,80.0,83.0,85.0,87.0,92.0
Romania,120.0,86.4,1.716945,82.0,85.0,86.0,87.25,92.0
Chile,4472.0,86.493515,2.692959,80.0,85.0,86.0,88.0,95.0
Bosnia and Herzegovina,2.0,86.5,2.12132,85.0,85.75,86.5,87.25,88.0
Argentina,3800.0,86.710263,3.179627,80.0,84.0,87.0,89.0,97.0
Uruguay,109.0,86.752294,2.687957,80.0,85.0,87.0,88.0,92.0


### To sort in descending order...
>- Use ascending = False

In [32]:
wineReviews.groupby(['country']).points.describe().sort_values(by='mean',ascending = False)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
England,74.0,91.581081,1.843216,89.0,90.0,91.0,93.0,95.0
India,9.0,90.222222,1.715938,87.0,90.0,90.0,91.0,93.0
Austria,3345.0,90.101345,2.499799,82.0,88.0,90.0,92.0,98.0
Germany,2165.0,89.851732,2.469351,81.0,88.0,90.0,91.0,98.0
Canada,257.0,89.36965,2.384752,82.0,88.0,90.0,91.0,94.0
Hungary,146.0,89.191781,2.686659,81.0,88.0,89.0,90.0,97.0
China,1.0,89.0,,89.0,89.0,89.0,89.0,89.0
France,22093.0,88.845109,3.044423,80.0,87.0,89.0,91.0,100.0
Luxembourg,6.0,88.666667,0.816497,88.0,88.0,88.5,89.0,90.0
Australia,2329.0,88.580507,2.9899,80.0,87.0,89.0,91.0,100.0
