# `pandas` Part 4: Grouping and Sorting

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Group data with `groupby()`
2. Sort data with `sort_values()`
 

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Analytics Project Framework Notes
## A complete and thorough analytics project will have 3 main areas
1. Descriptive Analytics: tells us what has happened or what is happening. 
>- The focus of this lesson is how to do this in python.
>- Many companies are at this level but not much more than this
>- Descriptive statistics (mean, median, mode, frequencies)
>- Graphical analysis (bar charts, pie charts, histograms, box-plots, etc)
2. Predictive Analytics: tells us what is likely to happen next
>- Less companies are at this level but are slowly getting there
>- Predictive statistics ("machine learning (ML)" using regression, multi-way frequency analysis, etc)
>- Graphical analysis (scatter plots with regression lines, decision trees, etc)
3. Prescriptive Analytics: tells us what to do based on the analysis
>- Synthesis and Report writing: executive summaries, data-based decision making
>- No analysis is complete without a written report with at least an executive summary
>- Communicate results of analysis to both non-technical and technical audiences

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

In [13]:
import os
import pandas as pd

os.getcwd()
for file in os.listdir():
    if '.csv' in file:
        print(file)
    elif '.xlsx' in file:
        print(file)
        
file1 = 'winemag-data-130k-v2 (2).csv'
file2 = 'students (1).csv'
file3 = 'students100 (1).xlsx'
        
#havent read in students files yet


students (1).csv
students100 (1).xlsx
winemag-data-130k-v2 (2).csv


In [11]:
dat1 = pd.read_csv(file1, index_col=0)
dat1.head(3)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `winemag-data-130k-v2.csv`
>- Set the index to column 0

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

In [14]:
dat1.shape

(129971, 13)

### Check a couple of rows of data

In [16]:
dat1.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# Descriptive Analytics with `groupby()`
>- General syntax: dataFrame.groupby(['fields to group by']).fieldsToanalyze.aggregation

### Now, what is/are the question(s) being asked of the data? 
>- All analytics projects start with questions (from you, your boss, some decision maker, etc)

###  How many wines have been rated at each point value?

In [17]:
dat1.groupby('points').points.count()

points
80       397
81       692
82      1836
83      3025
84      6480
85      9530
86     12600
87     16933
88     17207
89     12226
90     15410
91     11359
92      9613
93      6489
94      3758
95      1535
96       523
97       229
98        77
99        33
100       19
Name: points, dtype: int64

### How much does the least expensive wine for each point rating cost? 

In [22]:
min_price = dat1.groupby('points').price.min()
min_price

min_price = pd.DataFrame(min_price).reset_index()
min_price['ratio'] = min_price['price']/min_price['points']
min_price

#you can use .min and .max etc. funcitons for our minprice stuff

Unnamed: 0,points,price,ratio
0,80,5.0,0.0625
1,81,5.0,0.061728
2,82,4.0,0.04878
3,83,4.0,0.048193
4,84,4.0,0.047619
5,85,4.0,0.047059
6,86,4.0,0.046512
7,87,5.0,0.057471
8,88,6.0,0.068182
9,89,7.0,0.078652


### Question: How much does the most expensive wine for each point rating cost?

In [25]:
max_price = dat1.groupby('points').price.max()
max_price

max_price = pd.DataFrame(max_price).reset_index()
max_price['ratio'] = max_price['price']/max_price['points']
max_price

Unnamed: 0,points,price,ratio
0,80,69.0,0.8625
1,81,130.0,1.604938
2,82,150.0,1.829268
3,83,225.0,2.710843
4,84,225.0,2.678571
5,85,320.0,3.764706
6,86,170.0,1.976744
7,87,800.0,9.195402
8,88,3300.0,37.5
9,89,500.0,5.617978


### What is the overall maximum price for all wines?

In [None]:
### functions: count, fisrt, last, mean, median, 

In [26]:
max_price.price.max()

3300.0

### What is the lowest price for a wine rating of 100?

In [34]:
min_price.price.max()

80.0

In [35]:
min_price[min_price['points'] == 100]

Unnamed: 0,points,price,ratio
20,100,80.0,0.8


### What is the highest price for a wine rating of 80? 

In [36]:
max_price[max_price['points'] == 80]

Unnamed: 0,points,price,ratio
0,80,69.0,0.8625


### What is the maximum rating for each country? 

In [49]:
max_points = dat1.groupby('country').points.max()
max_points

max_points = pd.DataFrame(max_points)
max_points

Unnamed: 0_level_0,points
country,Unnamed: 1_level_1
Argentina,97
Armenia,88
Australia,100
Austria,98
Bosnia and Herzegovina,88
Brazil,89
Bulgaria,91
Canada,94
Chile,95
China,89


### What is the maximum rating for China?

In [43]:
max_points[max_points['country'] == 'China']

Unnamed: 0,country,points
9,China,89


In [50]:
max_points.loc['China']

points    89
Name: China, dtype: int64

In [44]:
max_points[max_points['country'] == 'Argentina']

Unnamed: 0,country,points
0,Argentina,97


##### Another way to get maximum ratring for China combining `where` and `groupby`

In [53]:
points_by_country = dat1.groupby('country').points.max().reset_index()
points_by_country.where(points_by_country['country'] == 'China').dropna()

Unnamed: 0,country,points
9,China,89.0


### What are some summary stats for price for each country? 
>- Using the `agg()` function for specific summary stats
>>- What is the sample size?
>>- What is the minimum?
>>- What is the maximum?
>>- What is the mean?
>>- What is the median?
>>- What is the standard deviation? 

In [55]:
country_stats = dat1.groupby('country').price.agg(['count','min','max','mean','median','std'])
country_stats

Unnamed: 0_level_0,count,min,max,mean,median,std
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Argentina,3756,4.0,230.0,24.510117,17.0,23.430122
Armenia,2,14.0,15.0,14.5,14.5,0.707107
Australia,2294,5.0,850.0,35.437663,21.0,49.049458
Austria,2799,7.0,1100.0,30.762772,25.0,27.224797
Bosnia and Herzegovina,2,12.0,13.0,12.5,12.5,0.707107
Brazil,47,10.0,60.0,23.765957,20.0,11.053649
Bulgaria,141,8.0,100.0,14.64539,13.0,9.508744
Canada,254,12.0,120.0,35.712598,30.0,19.658148
Chile,4416,5.0,400.0,20.786458,15.0,21.929371
China,1,18.0,18.0,18.0,18.0,


## What are the descriptive analytics for country and province? 
>- We can group by multiple fields by adding more to our groupby() function

In [57]:
drill_down = dat1.groupby(['country','province']).price.agg(['count','min','max','mean','median','std'])
drill_down

Unnamed: 0_level_0,Unnamed: 1_level_0,count,min,max,mean,median,std
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Argentina,Mendoza Province,3226,4.0,230.0,25.053317,17.0,24.044538
Argentina,Other,530,7.0,150.0,21.203774,15.0,18.958615
Armenia,Armenia,2,14.0,15.0,14.500000,14.5,0.707107
Australia,Australia Other,236,5.0,130.0,12.427966,11.0,9.548482
Australia,New South Wales,85,8.0,125.0,25.623529,19.0,21.072424
...,...,...,...,...,...,...,...
Uruguay,Juanico,12,10.0,130.0,48.583333,45.0,40.692770
Uruguay,Montevideo,11,17.0,60.0,26.090909,22.0,12.864327
Uruguay,Progreso,11,12.0,46.0,24.272727,20.0,13.184012
Uruguay,San Jose,3,20.0,50.0,30.000000,20.0,17.320508


## What are the descriptive price analytics for the US?
>- Add `get_group()` syntax

In [66]:
testing = dat1.groupby(['country','province']).price.agg(['count','min','max','mean','median','std']).reset_index()
testing[testing['province'] == "Colorado"]

Unnamed: 0,country,province,count,min,max,mean,median,std
393,US,Colorado,68,12.0,100.0,32.985294,30.0,16.543468


In [67]:
testing[testing['country'] == 'Argentina']

Unnamed: 0,country,province,count,min,max,mean,median,std
0,Argentina,Mendoza Province,3226,4.0,230.0,25.053317,17.0,24.044538
1,Argentina,Other,530,7.0,150.0,21.203774,15.0,18.958615


## What are the summary wine rating stats for Colorado? 
>- Note that states are coded in this dataset under province

In [68]:
testing[testing['province'] == 'Colorado']

Unnamed: 0,country,province,count,min,max,mean,median,std
393,US,Colorado,68,12.0,100.0,32.985294,30.0,16.543468


# Sorting Results
>- Add sort_values() syntax
>- Default is ascending order
## What are the summary stats for points for each country?
>- Sort the results from lowest to highest mean points

In [71]:
dat1.groupby('country').points.mean().sort_values(ascending = True)

country
Peru                      83.562500
Egypt                     84.000000
Ukraine                   84.071429
Brazil                    84.673077
Mexico                    85.257143
Romania                   86.400000
Chile                     86.493515
Bosnia and Herzegovina    86.500000
Argentina                 86.710263
Uruguay                   86.752294
Macedonia                 86.833333
Slovakia                  87.000000
Cyprus                    87.181818
Moldova                   87.203390
Croatia                   87.219178
Czech Republic            87.250000
Greece                    87.283262
Spain                     87.288337
Serbia                    87.500000
Armenia                   87.500000
Lebanon                   87.685714
Georgia                   87.686047
Bulgaria                  87.936170
South Africa              88.056388
Slovenia                  88.068966
Turkey                    88.088889
Portugal                  88.250220
New Zealand         

### To sort in descending order...
>- Use ascending = False

In [72]:
dat1.groupby('country').points.mean().sort_values(ascending = False)

country
England                   91.581081
India                     90.222222
Austria                   90.101345
Germany                   89.851732
Canada                    89.369650
Hungary                   89.191781
China                     89.000000
France                    88.845109
Luxembourg                88.666667
Australia                 88.580507
Switzerland               88.571429
Morocco                   88.571429
US                        88.563720
Italy                     88.562231
Israel                    88.471287
New Zealand               88.303030
Portugal                  88.250220
Turkey                    88.088889
Slovenia                  88.068966
South Africa              88.056388
Bulgaria                  87.936170
Georgia                   87.686047
Lebanon                   87.685714
Armenia                   87.500000
Serbia                    87.500000
Spain                     87.288337
Greece                    87.283262
Czech Republic      