# Agenda, week 3: Real-world data

1. Recap and Q&A
2. (More about) CSV file
    - Selecting columns
    - Selecting the index
    - Header lines
3. Reading online data
    - CSV
    - Scraping sites with Pandas
4. Sorting data
    - Sorting by value
    - Sorting by index
    - Sorting by multiple values
5. Grouping
    - What is grouping?
    - Aggregate functions
    - Grouping by multiple columns
6. Pivot tables
7. Joining
    - What is joining?
    - Simple joins across data frames
8. Cleaning data    

Please download the data file mentioned in the course page.  Warning: It's a big file! And it contains some very large CSV files!



In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series('1 10 20 50 89'.split())
s

0     1
1    10
2    20
3    50
4    89
dtype: object

In [3]:
s = Series('1 10 20 50 89'.split(), dtype=np.int64)   # this forces the dtype to be int64, and turns items into ints
s

  return bool(asarray(a1 == a2).all())


ValueError: values cannot be losslessly cast to int64

In [5]:
s = Series('1 10 20 50 89'.split())

In [6]:
s


0     1
1    10
2    20
3    50
4    89
dtype: object

In [7]:
s.astype(np.int64)

0     1
1    10
2    20
3    50
4    89
dtype: int64

In [12]:
s.astype(pd.StringDtype())

0     1
1    10
2    20
3    50
4    89
dtype: string

In [13]:
# is Python's None equal to itself?
None == None

True

In [14]:
# what type is Python's None?
type(None)

NoneType

In [15]:
# is NumPy's NaN equal to itself?
np.nan == np.nan

False

In [16]:
# what type is it?
type(np.nan)

float

In [17]:
s = Series([10, 20, 30, np.nan, 50, 60])

In [18]:
s

0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
5    60.0
dtype: float64

In [19]:
s = Series([10, 20, 30, None, 50, 60])
s

0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
5    60.0
dtype: float64

In [20]:
# there is a growing interest in using a special Pandas version of NaN, called pd.NA

s = Series([10, 20, 30, pd.NA, 50, 60])

In [21]:
s

0      10
1      20
2      30
3    <NA>
4      50
5      60
dtype: object

In [22]:
s.sum()

170

In [23]:
df = DataFrame(np.random.randint(0, 100, [4, 5]),
              index=list('abcd'),
              columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,28,66,9,33,99
b,66,92,99,87,39
c,3,92,37,34,84
d,38,8,44,3,87


In [24]:
df['wx'] = df['w'] + df['x']

In [25]:
df

Unnamed: 0,v,w,x,y,z,wx
a,28,66,9,33,99,75
b,66,92,99,87,39,191
c,3,92,37,34,84,129
d,38,8,44,3,87,52


# More about CSV

CSV is a standard, but with a *lot* of leeway in its interpretation. So when you have a CSV file, it might (or might not) have a line at the top naming the columns. It might (or might not) use commas to separate the values. It might or might not contains special types of data. 

In [26]:
help(pd.read_csv)   # show me the documentation for read_csv

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols=None, squeeze: 'bool | None' = None, prefix: 'str | lib.NoDefault' = <no_default>, mangle_dupe_cols: 'bool' = True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace: 'bool' = False, skiprows=None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, skip_blank_lines: 'bool' = True, parse_dates=None, infer_datetime_format: 'bool' = False, keep_date_col: 'bool' = F

In [27]:
# wine mag 150k reviews 

filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename)

In [28]:
# the first thing that I normally do when reading a CSV file is df.head()
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [29]:
# I also want to know: How big is this data frame?

df.shape

(150930, 11)

In [31]:
# How can I be more selective with my CSV file?

# What columns can I ignore from this file?
# - Unnamed: 0
# - winery
# - points
# - price

# I can specify "usecols", and a list of either column names *or* column numbers, starting with 0

filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename,
                usecols=['country', 'description', 'price', 'province', 'region_1', 'region_2', 'variety'])
df.head()

Unnamed: 0,country,description,price,province,region_1,region_2,variety
0,US,This tremendous 100% varietal wine hails from ...,235.0,California,Napa Valley,Napa,Cabernet Sauvignon
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",110.0,Northern Spain,Toro,,Tinta de Toro
2,US,Mac Watson honors the memory of a wine once ma...,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc
3,US,"This spent 20 months in 30% new French oak, an...",65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir
4,France,"This is the top wine from La Bégude, named aft...",66.0,Provence,Bandol,,Provence red blend


In [32]:
# if every row is 10 kb
# if we have 150k rows

150_000 * 10_000

1500000000

In [33]:
# cut down each row to be 8kb
150_000 * 8_000

1200000000

In [34]:
# let's make the index the country column

df = pd.read_csv(filename,
                usecols=['country', 'description', 'price', 'province', 'region_1', 'region_2', 'variety'])
df = df.set_index('country')
df.head()

Unnamed: 0_level_0,description,price,province,region_1,region_2,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
US,This tremendous 100% varietal wine hails from ...,235.0,California,Napa Valley,Napa,Cabernet Sauvignon
Spain,"Ripe aromas of fig, blackberry and cassis are ...",110.0,Northern Spain,Toro,,Tinta de Toro
US,Mac Watson honors the memory of a wine once ma...,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc
US,"This spent 20 months in 30% new French oak, an...",65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir
France,"This is the top wine from La Bégude, named aft...",66.0,Provence,Bandol,,Provence red blend


In [36]:
# we can do this in one step, when reading the CSV file in from disk

df = pd.read_csv(filename,
                usecols=['country', 'description', 'price', 'province', 'region_1', 'region_2', 'variety'],
                index_col='country')
df.head()

Unnamed: 0_level_0,description,price,province,region_1,region_2,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
US,This tremendous 100% varietal wine hails from ...,235.0,California,Napa Valley,Napa,Cabernet Sauvignon
Spain,"Ripe aromas of fig, blackberry and cassis are ...",110.0,Northern Spain,Toro,,Tinta de Toro
US,Mac Watson honors the memory of a wine once ma...,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc
US,"This spent 20 months in 30% new French oak, an...",65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir
France,"This is the top wine from La Bégude, named aft...",66.0,Provence,Bandol,,Provence red blend


In [39]:
df.loc['Albania']

Unnamed: 0_level_0,description,price,province,region_1,region_2,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albania,This garnet-colored wine made from 100% Kallme...,20.0,Mirditë,,,Kallmet
Albania,This garnet-colored wine made from 100% Kallme...,20.0,Mirditë,,,Kallmet


In [40]:
# I sync to GitHub with "gitautopush"
# Just search for it on PyPI, and install it

# Exercise: Wine reviews

1. Read the wine-150k-reviews data set into a data frame. We only care about country, price, and variety.
2. Which wine in this data set has the highest price?
3. Which country has the most wines in this data set?
4. What is the average price of Cabernet Sauvignon? How about Malbec?

In [41]:
filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename, 
                usecols=['country', 'price', 'variety'])
df.head()

Unnamed: 0,country,price,variety
0,US,235.0,Cabernet Sauvignon
1,Spain,110.0,Tinta de Toro
2,US,90.0,Sauvignon Blanc
3,US,65.0,Pinot Noir
4,France,66.0,Provence red blend


In [42]:
df.dtypes

country     object
price      float64
variety     object
dtype: object

In [45]:
# which wine has the highest price?
df.loc[df['price'] == df['price'].max()]

Unnamed: 0,country,price,variety
34920,France,2300.0,Bordeaux-style Red Blend


In [48]:
# which country has the most wines in this data set?
df['country'].value_counts().head(1)

US    62397
Name: country, dtype: int64

In [51]:
# What is the average price of Cabernet Sauvignon? 

# row selector: variety has to be CS
# column selector: price

df.loc[
    df['variety'] == 'Cabernet Sauvignon'   # row selector
    ,
    'price'  # column selector
].mean()

42.146634046247335

In [52]:
# How about Malbec?


df.loc[
    df['variety'] == 'Malbec'   # row selector
    ,
    'price'  # column selector
].mean()

25.631118314424636

Data files are here: https://files.lerner.co.il/pandas-workout-data.zip

# Let's look again at `pd.read_csv`

In [53]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols=None, squeeze: 'bool | None' = None, prefix: 'str | lib.NoDefault' = <no_default>, mangle_dupe_cols: 'bool' = True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace: 'bool' = False, skiprows=None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, skip_blank_lines: 'bool' = True, parse_dates=None, infer_datetime_format: 'bool' = False, keep_date_col: 'bool' = F

In [54]:
url = 'https://gist.githubusercontent.com/reuven/bb116ba2034bb10bb7e4e2caa5d8a000/raw/3660c4af808684dbf17af48b3d2f25b6a218535f/CSCO.csv'

In [55]:
pd.read_csv(url)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2017-12-22,38.52,38.740002,38.470001,38.549999,38.264591,11441600
1,2017-12-26,38.549999,38.68,38.360001,38.48,38.19511,8186100
2,2017-12-27,38.540001,38.650002,38.450001,38.560001,38.274517,10543000
3,2017-12-28,38.73,38.73,38.450001,38.59,38.304295,8807700
4,2017-12-29,38.41,38.619999,38.299999,38.299999,38.016441,12583600
5,2018-01-02,38.669998,38.950001,38.43,38.860001,38.572296,20135700
6,2018-01-03,38.720001,39.279999,38.529999,39.169998,38.879997,29536000
7,2018-01-04,39.049999,39.540001,38.93,38.990002,38.990002,20731400
8,2018-01-05,39.549999,39.880001,39.369999,39.529999,39.529999,24588200
9,2018-01-08,39.52,39.959999,39.349998,39.939999,39.939999,16582000


In [56]:
# Latest Bitcoin prices are at https://api.blockchain.info/charts/market-price?format=csv

In [57]:
url = 'https://api.blockchain.info/charts/market-price?format=csv'

df = pd.read_csv(url)
df.head()

Unnamed: 0,2021-11-29 00:00:00,57292.28
0,2021-11-30 00:00:00,57828.45
1,2021-12-01 00:00:00,57025.79
2,2021-12-02 00:00:00,57229.76
3,2021-12-03 00:00:00,56508.48
4,2021-12-04 00:00:00,53713.84


In [58]:
df.tail()

Unnamed: 0,2021-11-29 00:00:00,57292.28
360,2022-11-25 00:00:00,16592.67
361,2022-11-26 00:00:00,16507.44
362,2022-11-27 00:00:00,16453.47
363,2022-11-28 00:00:00,16420.2
364,2022-11-29 00:00:00,16208.96


In [61]:
df = pd.read_csv(url,
                 header=None,
                 names=['date', 'bitcoin'],
                 index_col='date')
df.head()


Unnamed: 0_level_0,bitcoin
date,Unnamed: 1_level_1
2021-11-29 00:00:00,57292.28
2021-11-30 00:00:00,57828.45
2021-12-01 00:00:00,57025.79
2021-12-02 00:00:00,57229.76
2021-12-03 00:00:00,56508.48


In [62]:
df.loc['2022-07-14 00:00:00']

bitcoin    20223.69
Name: 2022-07-14 00:00:00, dtype: float64

In [64]:
# let's say that I like the GDP data in Wikipedia
# I'd like to read it into Pandas, into a data frame

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

# read_html retrieves every HTML table on that web page into a data frame
# you get back a list of data frames, one per HTML table
all_dfs = pd.read_html(url)

In [65]:
all_dfs[3]

Unnamed: 0_level_0,Country/Territory,UN Region,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,UN Region,Estimate,Year,Estimate,Year,Estimate,Year
0,World,—,101560901,2022,96100091,2021,85328323,2020
1,United States,Americas,25035164,2022,22996100,2021,20893746,2020
2,China,Asia,18321197,[n 1]2022,17734063,[n 3]2021,14722801,[n 1]2020
3,European Union[n 4],Europe,16613060,2022,17088621,2021,15292201,[16]2020
4,Japan,Asia,4300621,2022,4937422,2021,5057759,2020
...,...,...,...,...,...,...,...,...
213,Palau,Oceania,226,2022,258,2020,264,2020
214,Kiribati,Oceania,207,2022,181,2020,181,2020
215,Nauru,Oceania,134,2022,133,2021,135,2020
216,Montserrat,Americas,—,—,—,—,68,2020


In [67]:
df = all_dfs[3]
df.columns=['country', 'un_region', 'imf_estimate', 'imf_year', 'wb_estimate', 'wb_year', 'un_estimate', 'un_year']

In [68]:
df

Unnamed: 0,country,un_region,imf_estimate,imf_year,wb_estimate,wb_year,un_estimate,un_year
0,World,—,101560901,2022,96100091,2021,85328323,2020
1,United States,Americas,25035164,2022,22996100,2021,20893746,2020
2,China,Asia,18321197,[n 1]2022,17734063,[n 3]2021,14722801,[n 1]2020
3,European Union[n 4],Europe,16613060,2022,17088621,2021,15292201,[16]2020
4,Japan,Asia,4300621,2022,4937422,2021,5057759,2020
...,...,...,...,...,...,...,...,...
213,Palau,Oceania,226,2022,258,2020,264,2020
214,Kiribati,Oceania,207,2022,181,2020,181,2020
215,Nauru,Oceania,134,2022,133,2021,135,2020
216,Montserrat,Americas,—,—,—,—,68,2020


# Exercise: Blockchain downloads

Bitcoin info: 'https://api.blockchain.info/charts/market-price?format=csv'

1. Create a data frame from the Bitcoin info, in which the date is the index.
2. On which date was Bitcoin at its highest value?
3. On which date was it at its lowest value? (The info only goes back one year, I believe.)

In [69]:
url = 'https://api.blockchain.info/charts/market-price?format=csv'

df = pd.read_csv(url,
                 header=None,
                names=['date', 'btc'],
                index_col='date')
df.head()

Unnamed: 0_level_0,btc
date,Unnamed: 1_level_1
2021-11-29 00:00:00,57292.28
2021-11-30 00:00:00,57828.45
2021-12-01 00:00:00,57025.79
2021-12-02 00:00:00,57229.76
2021-12-03 00:00:00,56508.48


In [72]:
# on which date was bitcoin at its highest value?
df.loc[df['btc'] == df['btc'].max()]

Unnamed: 0_level_0,btc
date,Unnamed: 1_level_1
2021-11-30 00:00:00,57828.45


In [73]:
# On which date was it at its lowest value?

df.loc[df['btc'] == df['btc'].min()]

Unnamed: 0_level_0,btc
date,Unnamed: 1_level_1
2022-11-22 00:00:00,15759.61


# Good sources for interesting datasets

1. https://www.kaggle.com/datasets
2. https://github.com/awesomedata/awesome-public-datasets

# Next up

1. Sorting
2. Basic grouping

# Sorting

It's common for us to want to sort our data.  If I just want to pick out the highest value, or the lowest value, I can do that with a boolean index and grabbing the first or last value.  But if I'm going to want the 10 largest values, then sorting is going to be more useful.  Also, if I'm going to *look* at the data, then sorting can be useful.

In [74]:
filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename)

In [75]:
# What are the 10 most expensive wines in this database?
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [77]:
# I want to sort the rows of df by the "price" column

df.sort_values('price')  

# (1) The rows are sorted in ascending order (from lowest to highest)
# (2) We haven't modified df -- rather, we got a new data frame back from .sort_values

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
90546,90546,Argentina,Clean as anyone should reasonably expect given...,,85,4.0,Mendoza Province,Mendoza,,Malbec,Toca Diamonte
25645,25645,US,"There's a lot going on in this Merlot, which i...",,86,4.0,California,California,California Other,Merlot,Bandit
118347,118347,US,"Light and earthy, this wine-in-a-box is clean ...",,84,4.0,California,California,California Other,Cabernet Sauvignon,Bandit
1858,1858,US,"Sweet and fruity, this canned wine feels soft ...",Unoaked,83,4.0,California,California,California Other,Chardonnay,Pam's Cuties
91766,91766,Argentina,"Crimson in color but also translucent, with a ...",Red,84,4.0,Mendoza Province,Mendoza,,Malbec-Syrah,Broke Ass
...,...,...,...,...,...,...,...,...,...,...,...
150377,150377,New Zealand,"Light and a bit herbal, like a pleasant St.-Jo...",Matheson,84,,Hawke's Bay,,,Syrah,Matua Valley
150378,150378,New Zealand,"Impressive purple color, but less intense on t...",,84,,Martinborough,,,Syrah,Kusuda
150587,150587,Canada,"Shows pronounced oily, earthy, almost tobacco-...",Icewine,90,,Ontario,Lake Erie North Shore,,Riesling,Colio
150673,150673,US,"Cherry-scented, clean and fruity. Good concent...",,87,,California,Dry Creek Valley,Sonoma,Zinfandel,Taft Street


In [78]:
# since we're sorting by price, in ascending order, the 10 most expensive wines will be
# the 10 final rows

df.sort_values('price').tail(10)


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
150255,150255,New Zealand,"Starts with scents of anise and blackberry, th...",,85,,Hawke's Bay,,,Syrah,Vidal
150260,150260,France,"Always reliable, Hanteillan has reflected the ...",,85,,Bordeaux,Haut-Médoc,,Bordeaux-style Red Blend,Château Hanteillan
150261,150261,New Zealand,"A bit heavy for Riesling, with pretty pear and...",,85,,Waipara,,,Riesling,Daniel Schuster
150319,150319,New Zealand,"A bit jammy, with aromas and flavors of slight...",Innovator Bullrush,85,,Hawke's Bay,,,Syrah,Matua Valley
150322,150322,New Zealand,"Impressively dark in color, but shows more woo...",Reserve,84,,Hawke's Bay,,,Syrah,CJ Pask
150377,150377,New Zealand,"Light and a bit herbal, like a pleasant St.-Jo...",Matheson,84,,Hawke's Bay,,,Syrah,Matua Valley
150378,150378,New Zealand,"Impressive purple color, but less intense on t...",,84,,Martinborough,,,Syrah,Kusuda
150587,150587,Canada,"Shows pronounced oily, earthy, almost tobacco-...",Icewine,90,,Ontario,Lake Erie North Shore,,Riesling,Colio
150673,150673,US,"Cherry-scented, clean and fruity. Good concent...",,87,,California,Dry Creek Valley,Sonoma,Zinfandel,Taft Street
150922,150922,Italy,Made by 30-ish Roberta Borghese high above Man...,Superiore,91,,Northeastern Italy,Colli Orientali del Friuli,,Tocai,Ronchi di Manzano


In [79]:
filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename,
                usecols=['country', 'variety', 'price'])

In [83]:
# Here, I:

# (1) removed rows containing NaN
# (2) sorted the remaining rows by price, in ascending order
# (3) look at the final 10 rows of what remains

df.dropna().sort_values('price').tail(10)  # remove all rows containing NaN

Unnamed: 0,country,price,variety
34927,France,1100.0,Bordeaux-style Red Blend
10651,Austria,1100.0,Grüner Veltliner
34942,France,1200.0,Bordeaux-style Red Blend
34939,France,1300.0,Bordeaux-style Red Blend
83536,France,1400.0,Chardonnay
26296,France,1400.0,Chardonnay
51886,France,1400.0,Chardonnay
34922,France,1900.0,Bordeaux-style Red Blend
13318,US,2013.0,Chardonnay
34920,France,2300.0,Bordeaux-style Red Blend


In [84]:
# let's look at sort_values, and see what options it gives
help(df.sort_values)

Help on method sort_values in module pandas.core.frame:

sort_values(by: 'IndexLabel', *, axis: 'Axis' = 0, ascending: 'bool | list[bool] | tuple[bool, ...]' = True, inplace: 'bool' = False, kind: 'str' = 'quicksort', na_position: 'str' = 'last', ignore_index: 'bool' = False, key: 'ValueKeyFunc' = None) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Sort by the values along either axis.
    
    Parameters
    ----------
            by : str or list of str
                Name or list of names to sort by.
    
                - if `axis` is 0 or `'index'` then `by` may contain index
                  levels and/or column labels.
                - if `axis` is 1 or `'columns'` then `by` may contain column
                  levels and/or index labels.
    axis : {0 or 'index', 1 or 'columns'}, default 0
         Axis to be sorted.
    ascending : bool or list of bool, default True
         Sort ascending vs. descending. Specify list for multiple sort
         or

In [86]:
df.sort_values('price', na_position='first').tail(10)

Unnamed: 0,country,price,variety
10651,Austria,1100.0,Grüner Veltliner
34927,France,1100.0,Bordeaux-style Red Blend
34942,France,1200.0,Bordeaux-style Red Blend
34939,France,1300.0,Bordeaux-style Red Blend
26296,France,1400.0,Chardonnay
83536,France,1400.0,Chardonnay
51886,France,1400.0,Chardonnay
34922,France,1900.0,Bordeaux-style Red Blend
13318,US,2013.0,Chardonnay
34920,France,2300.0,Bordeaux-style Red Blend


# `inplace`

Pandas methods almost always return a new data frame, rather than modifying the data frame itself. This might seem wasteful (in terms of memory and performance), but the the Pandas core developers assure us that this is not the case.

You can, if you want, pass `inplace=True` to a very large number of Pandas methods. If you do that, then the method will return `None`, and you'll modify the data frame itself, in place.

However, the Pandas core developers are planning to remove `inplace` from most (or all) methods, and basically beg us not to use it.

In [88]:
# sometimes, I want to sort by the index, rather than by the column

df = df.set_index('country')
df

Unnamed: 0_level_0,price,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1
US,235.0,Cabernet Sauvignon
Spain,110.0,Tinta de Toro
US,90.0,Sauvignon Blanc
US,65.0,Pinot Noir
France,66.0,Provence red blend
...,...,...
Italy,20.0,White Blend
France,27.0,Champagne Blend
Italy,20.0,White Blend
France,52.0,Champagne Blend


In [92]:
# how can I sort my data frame, such that the index is ordered alphabetically?
# I can use df.sort_index()

df.sort_index(ascending=False)

Unnamed: 0_level_0,price,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Uruguay,15.0,Viognier
Uruguay,30.0,Tannat
Uruguay,52.0,Red Blend
Uruguay,17.0,Tannat
Uruguay,10.0,Tannat
...,...,...
,17.0,Assyrtiko
,30.0,Red Blend
,15.0,Pinot Noir
,15.0,Pinot Noir


# Exercise: High and low temps

1. Create a data frame from the file `new+york,ny.csv`. This contains weather information over a 4-month period (Decembee 2018 - March 2019) in New York City.
2. You only need to load a few columns: `date_time`, `new+york,ny_maxtempC`, `new+york,ny_mintempC`.
3. Rename the columns to be `date_time`, `max_temp`, and `min_temp`.
4. Set the `date_time` column to be the index.
5. Find the 5 lowest temperatures recorded in New York during this period.
6. Find the 5 highest temperatures recorded in New York during this period.


In [101]:
filename = '/Users/reuven/Courses/Current/data/new+york,ny.csv'

df = pd.read_csv(filename,
            usecols=[0, 1, 2],   # use the numbers when you're going to change the names
            names=['date_time', 'max_temp', 'min_temp'],
            header=0,
            index_col='date_time')

In [102]:
df.head()

Unnamed: 0_level_0,max_temp,min_temp
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-12-11 00:00:00,4,-1
2018-12-11 03:00:00,4,-1
2018-12-11 06:00:00,4,-1
2018-12-11 09:00:00,4,-1
2018-12-11 12:00:00,4,-1


In [104]:
df.sort_values('min_temp').head(5)

Unnamed: 0_level_0,max_temp,min_temp
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-31 09:00:00,-8,-14
2019-01-31 00:00:00,-8,-14
2019-01-31 12:00:00,-8,-14
2019-01-21 21:00:00,-12,-14
2019-01-31 06:00:00,-8,-14


In [106]:
df.sort_values('max_temp').tail(5)

Unnamed: 0_level_0,max_temp,min_temp
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-12-21 12:00:00,15,12
2018-12-21 09:00:00,15,12
2018-12-21 06:00:00,15,12
2018-12-21 03:00:00,15,12
2018-12-21 00:00:00,15,12


In [108]:
# sorting by more than one column
# sort by min_temp, and then (if there's a tie) by max_temp

# just provide a list of column names, rather than a single column name

df.sort_values(['min_temp', 'max_temp']).head(30)

Unnamed: 0_level_0,max_temp,min_temp
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-21 00:00:00,-12,-14
2019-01-21 03:00:00,-12,-14
2019-01-21 06:00:00,-12,-14
2019-01-21 09:00:00,-12,-14
2019-01-21 12:00:00,-12,-14
2019-01-21 15:00:00,-12,-14
2019-01-21 18:00:00,-12,-14
2019-01-21 21:00:00,-12,-14
2019-01-31 00:00:00,-8,-14
2019-01-31 03:00:00,-8,-14


# Grouping

If I read the wine data into Pandas, I can find out:

- What's the average price of wines from France?
- What's the average price of wines from the US?
- What's the average price of wines from Chile?

In [109]:
filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename,
                usecols=['country', 'price'])
df.head()

Unnamed: 0,country,price
0,US,235.0
1,Spain,110.0
2,US,90.0
3,US,65.0
4,France,66.0


In [113]:
# average price of wines in France
df.loc[df['country'] == 'France',  # row selector
      'price'].mean()              # column selector

45.61988501859993

In [114]:
df.loc[df['country'] == 'US',      # row selector
      'price'].mean()              # column selector

33.65380839730282

In [115]:
df.loc[df['country'] == 'Chile',      # row selector
      'price'].mean()              # column selector

19.344779743322928

At a certain point, this becomes tedious.

I'd rather ask Pandas to take every unique value in df['country'], and calculate the mean price of all wines in that country:

- Take each country mentioned in df['country']
- Create a mask index, to find only wines from that country
- Retrieve the value from the `price` column
- Take the mean of those values

That is grouping!

In [117]:
# what unique values? (country)
# on what column to calculate? (price)
# what method do we want to run? (mean)

# we get back a series in which the index contains the different values
# for "country", the values in the series represent the mean prices for
# each country.

df.groupby('country')['price'].mean()

country
Albania                   20.000000
Argentina                 20.794881
Australia                 31.258480
Austria                   31.192106
Bosnia and Herzegovina    12.750000
Brazil                    19.920000
Bulgaria                  11.545455
Canada                    34.628866
Chile                     19.344780
China                     20.333333
Croatia                   23.108434
Cyprus                    15.483871
Czech Republic            18.000000
Egypt                           NaN
England                   47.500000
France                    45.619885
Georgia                   18.581395
Germany                   39.011078
Greece                    21.747706
Hungary                   44.204348
India                     13.875000
Israel                    31.304918
Italy                     37.547913
Japan                     24.000000
Lebanon                   25.432432
Lithuania                 10.000000
Luxembourg                40.666667
Macedonia           

# Examples of grouping

- Show total sales, grouped by region
- Show mean infection rates, grouped by country
- Show population, grouped by state
- Show mean SAT score, grouped by university

What methods can we use when grouping? Any aggregation method -- it takes many values, and returns a single value.

- `mean`
- `std`
- `sum`
- `count`

# Exercise: Grouping wines

1. Read the wine data set into a data frame. Keep the country, price, region, and variety columns.
2. What country has the most expensive wines, on average?
3. What variety is most popular?
4. What region produces the cheapest wines?