# Agenda, week 3: Real-world data

1. Recap and Q&A
2. (More about) CSV file
    - Selecting columns
    - Selecting the index
    - Header lines
3. Reading online data
    - CSV
    - Scraping sites with Pandas
4. Sorting data
    - Sorting by value
    - Sorting by index
    - Sorting by multiple values
5. Grouping
    - What is grouping?
    - Aggregate functions
    - Grouping by multiple columns
6. Pivot tables
7. Joining
    - What is joining?
    - Simple joins across data frames
8. Cleaning data    

Please download the data file mentioned in the course page.  Warning: It's a big file! And it contains some very large CSV files!



In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series('1 10 20 50 89'.split())
s

0     1
1    10
2    20
3    50
4    89
dtype: object

In [3]:
s = Series('1 10 20 50 89'.split(), dtype=np.int64)   # this forces the dtype to be int64, and turns items into ints
s

  return bool(asarray(a1 == a2).all())


ValueError: values cannot be losslessly cast to int64

In [5]:
s = Series('1 10 20 50 89'.split())

In [6]:
s


0     1
1    10
2    20
3    50
4    89
dtype: object

In [7]:
s.astype(np.int64)

0     1
1    10
2    20
3    50
4    89
dtype: int64

In [12]:
s.astype(pd.StringDtype())

0     1
1    10
2    20
3    50
4    89
dtype: string

In [13]:
# is Python's None equal to itself?
None == None

True

In [14]:
# what type is Python's None?
type(None)

NoneType

In [15]:
# is NumPy's NaN equal to itself?
np.nan == np.nan

False

In [16]:
# what type is it?
type(np.nan)

float

In [17]:
s = Series([10, 20, 30, np.nan, 50, 60])

In [18]:
s

0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
5    60.0
dtype: float64

In [19]:
s = Series([10, 20, 30, None, 50, 60])
s

0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
5    60.0
dtype: float64

In [20]:
# there is a growing interest in using a special Pandas version of NaN, called pd.NA

s = Series([10, 20, 30, pd.NA, 50, 60])

In [21]:
s

0      10
1      20
2      30
3    <NA>
4      50
5      60
dtype: object

In [22]:
s.sum()

170

In [23]:
df = DataFrame(np.random.randint(0, 100, [4, 5]),
              index=list('abcd'),
              columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,28,66,9,33,99
b,66,92,99,87,39
c,3,92,37,34,84
d,38,8,44,3,87


In [24]:
df['wx'] = df['w'] + df['x']

In [25]:
df

Unnamed: 0,v,w,x,y,z,wx
a,28,66,9,33,99,75
b,66,92,99,87,39,191
c,3,92,37,34,84,129
d,38,8,44,3,87,52


# More about CSV

CSV is a standard, but with a *lot* of leeway in its interpretation. So when you have a CSV file, it might (or might not) have a line at the top naming the columns. It might (or might not) use commas to separate the values. It might or might not contains special types of data. 

In [26]:
help(pd.read_csv)   # show me the documentation for read_csv

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols=None, squeeze: 'bool | None' = None, prefix: 'str | lib.NoDefault' = <no_default>, mangle_dupe_cols: 'bool' = True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace: 'bool' = False, skiprows=None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, skip_blank_lines: 'bool' = True, parse_dates=None, infer_datetime_format: 'bool' = False, keep_date_col: 'bool' = F

In [27]:
# wine mag 150k reviews 

filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename)

In [28]:
# the first thing that I normally do when reading a CSV file is df.head()
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [29]:
# I also want to know: How big is this data frame?

df.shape

(150930, 11)

In [31]:
# How can I be more selective with my CSV file?

# What columns can I ignore from this file?
# - Unnamed: 0
# - winery
# - points
# - price

# I can specify "usecols", and a list of either column names *or* column numbers, starting with 0

filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename,
                usecols=['country', 'description', 'price', 'province', 'region_1', 'region_2', 'variety'])
df.head()

Unnamed: 0,country,description,price,province,region_1,region_2,variety
0,US,This tremendous 100% varietal wine hails from ...,235.0,California,Napa Valley,Napa,Cabernet Sauvignon
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",110.0,Northern Spain,Toro,,Tinta de Toro
2,US,Mac Watson honors the memory of a wine once ma...,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc
3,US,"This spent 20 months in 30% new French oak, an...",65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir
4,France,"This is the top wine from La Bégude, named aft...",66.0,Provence,Bandol,,Provence red blend


In [32]:
# if every row is 10 kb
# if we have 150k rows

150_000 * 10_000

1500000000

In [33]:
# cut down each row to be 8kb
150_000 * 8_000

1200000000

In [34]:
# let's make the index the country column

df = pd.read_csv(filename,
                usecols=['country', 'description', 'price', 'province', 'region_1', 'region_2', 'variety'])
df = df.set_index('country')
df.head()

Unnamed: 0_level_0,description,price,province,region_1,region_2,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
US,This tremendous 100% varietal wine hails from ...,235.0,California,Napa Valley,Napa,Cabernet Sauvignon
Spain,"Ripe aromas of fig, blackberry and cassis are ...",110.0,Northern Spain,Toro,,Tinta de Toro
US,Mac Watson honors the memory of a wine once ma...,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc
US,"This spent 20 months in 30% new French oak, an...",65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir
France,"This is the top wine from La Bégude, named aft...",66.0,Provence,Bandol,,Provence red blend


In [36]:
# we can do this in one step, when reading the CSV file in from disk

df = pd.read_csv(filename,
                usecols=['country', 'description', 'price', 'province', 'region_1', 'region_2', 'variety'],
                index_col='country')
df.head()

Unnamed: 0_level_0,description,price,province,region_1,region_2,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
US,This tremendous 100% varietal wine hails from ...,235.0,California,Napa Valley,Napa,Cabernet Sauvignon
Spain,"Ripe aromas of fig, blackberry and cassis are ...",110.0,Northern Spain,Toro,,Tinta de Toro
US,Mac Watson honors the memory of a wine once ma...,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc
US,"This spent 20 months in 30% new French oak, an...",65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir
France,"This is the top wine from La Bégude, named aft...",66.0,Provence,Bandol,,Provence red blend


In [39]:
df.loc['Albania']

Unnamed: 0_level_0,description,price,province,region_1,region_2,variety
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albania,This garnet-colored wine made from 100% Kallme...,20.0,Mirditë,,,Kallmet
Albania,This garnet-colored wine made from 100% Kallme...,20.0,Mirditë,,,Kallmet


In [40]:
# I sync to GitHub with "gitautopush"
# Just search for it on PyPI, and install it

# Exercise: Wine reviews

1. Read the wine-150k-reviews data set into a data frame. We only care about country, price, and variety.
2. Which wine in this data set has the highest price?
3. Which country has the most wines in this data set?
4. What is the average price of Cabernet Sauvignon? How about Malbec?

In [41]:
filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename, 
                usecols=['country', 'price', 'variety'])
df.head()

Unnamed: 0,country,price,variety
0,US,235.0,Cabernet Sauvignon
1,Spain,110.0,Tinta de Toro
2,US,90.0,Sauvignon Blanc
3,US,65.0,Pinot Noir
4,France,66.0,Provence red blend


In [42]:
df.dtypes

country     object
price      float64
variety     object
dtype: object

In [45]:
# which wine has the highest price?
df.loc[df['price'] == df['price'].max()]

Unnamed: 0,country,price,variety
34920,France,2300.0,Bordeaux-style Red Blend


In [48]:
# which country has the most wines in this data set?
df['country'].value_counts().head(1)

US    62397
Name: country, dtype: int64

In [51]:
# What is the average price of Cabernet Sauvignon? 

# row selector: variety has to be CS
# column selector: price

df.loc[
    df['variety'] == 'Cabernet Sauvignon'   # row selector
    ,
    'price'  # column selector
].mean()

42.146634046247335

In [52]:
# How about Malbec?


df.loc[
    df['variety'] == 'Malbec'   # row selector
    ,
    'price'  # column selector
].mean()

25.631118314424636

Data files are here: https://files.lerner.co.il/pandas-workout-data.zip

# Let's look again at `pd.read_csv`

In [53]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols=None, squeeze: 'bool | None' = None, prefix: 'str | lib.NoDefault' = <no_default>, mangle_dupe_cols: 'bool' = True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace: 'bool' = False, skiprows=None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, skip_blank_lines: 'bool' = True, parse_dates=None, infer_datetime_format: 'bool' = False, keep_date_col: 'bool' = F

In [54]:
url = 'https://gist.githubusercontent.com/reuven/bb116ba2034bb10bb7e4e2caa5d8a000/raw/3660c4af808684dbf17af48b3d2f25b6a218535f/CSCO.csv'

In [55]:
pd.read_csv(url)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2017-12-22,38.52,38.740002,38.470001,38.549999,38.264591,11441600
1,2017-12-26,38.549999,38.68,38.360001,38.48,38.19511,8186100
2,2017-12-27,38.540001,38.650002,38.450001,38.560001,38.274517,10543000
3,2017-12-28,38.73,38.73,38.450001,38.59,38.304295,8807700
4,2017-12-29,38.41,38.619999,38.299999,38.299999,38.016441,12583600
5,2018-01-02,38.669998,38.950001,38.43,38.860001,38.572296,20135700
6,2018-01-03,38.720001,39.279999,38.529999,39.169998,38.879997,29536000
7,2018-01-04,39.049999,39.540001,38.93,38.990002,38.990002,20731400
8,2018-01-05,39.549999,39.880001,39.369999,39.529999,39.529999,24588200
9,2018-01-08,39.52,39.959999,39.349998,39.939999,39.939999,16582000


In [56]:
# Latest Bitcoin prices are at https://api.blockchain.info/charts/market-price?format=csv

In [57]:
url = 'https://api.blockchain.info/charts/market-price?format=csv'

df = pd.read_csv(url)
df.head()

Unnamed: 0,2021-11-29 00:00:00,57292.28
0,2021-11-30 00:00:00,57828.45
1,2021-12-01 00:00:00,57025.79
2,2021-12-02 00:00:00,57229.76
3,2021-12-03 00:00:00,56508.48
4,2021-12-04 00:00:00,53713.84


In [58]:
df.tail()

Unnamed: 0,2021-11-29 00:00:00,57292.28
360,2022-11-25 00:00:00,16592.67
361,2022-11-26 00:00:00,16507.44
362,2022-11-27 00:00:00,16453.47
363,2022-11-28 00:00:00,16420.2
364,2022-11-29 00:00:00,16208.96
