# Agenda

1. Pandas and strings (and their dtypes)
2. String methods (and how to call them)
3. Memory and strings
4. Categories

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
filename = '../data/winemag-150k-reviews.csv'
df = pd.read_csv(filename)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [4]:
df.shape

(150930, 11)

In [5]:
df.dtypes

Unnamed: 0       int64
country         object
description     object
designation     object
points           int64
price          float64
province        object
region_1        object
region_2        object
variety         object
winery          object
dtype: object

When we read from a CSV file, Pandas normally has to guess what `dtype` to assign to each column. 

- If all of the values in a column are digits, then it assigns `int64`
- If the values are digits plus one decimal point, then it assigns `float64`
- In any other case, it assigns `object`

We'll treat all of tehse values as Python strinfgs, and we'll refer to those Python objects with a pointer.  In theory, `object` could be any Python object. UjBut in practice. almost always `object` means a string.

In [6]:
df.loc[0, 'description']  # row index 0, column 'description'

'This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.'

In [8]:
type(df.loc[0, 'description'])  # row index 0, column 'description'

str

How can we work with these strings?

We could use a `for` loop to go through each string and invoke one or more methods on it. We want to use the vecotirzation capabilities of Pandas, which makes things much faster than that.

Pandas gives us `.str`, which is an attribute on string columns, via which we can invoke a variety of methods.  We have, in this way, access to all of Python's string methods, *plus* a bunch of others that Pandas has borrowed from other languages, *plus* some extensions that Pandas found useful.

In [10]:
# what is the length of each contry's name? 

len(df['country'])  # this will give us the column's length, not the length of each element

150930

In [11]:
# the result is a series with the same index, but with values describing tghe length

df['country'].str.len()  

0         2.0
1         5.0
2         2.0
3         2.0
4         6.0
         ... 
150925    5.0
150926    6.0
150927    5.0
150928    6.0
150929    5.0
Name: country, Length: 150930, dtype: float64

In [13]:
# which countries have > 5 letters in their names? 
df.loc[df['country'].str.len() > 5]

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
13,13,France,This wine is in peak condition. The tannins an...,Château Montus Prestige,95,90.0,Southwest France,Madiran,,Tannat,Vignobles Brumont
18,18,France,Coming from a seven-acre vineyard named after ...,Le Pigeonnier,95,290.0,Southwest France,Cahors,,Malbec,Château Lagrézette
25,25,New Zealand,"Yields were down in 2015, but intensity is up,...",Maté's Vineyard,94,57.0,Kumeu,,,Chardonnay,Kumeu River
30,30,Bulgaria,This Bulgarian Mavrud presents the nose with s...,Bergulé,90,15.0,Bulgaria,,,Mavrud,Villa Melnik
...,...,...,...,...,...,...,...,...,...,...,...
150921,150921,France,Shows some older notes: a bouquet of toasted w...,Blanc de Blancs Brut Mosaïque,91,38.0,Champagne,Champagne,,Champagne Blend,Jacquart
150923,150923,France,"Rich and toasty, with tiny bubbles. The bouque...",Demi-Sec,91,30.0,Champagne,Champagne,,Champagne Blend,Jacquart
150924,150924,France,"Really fine for a low-acid vintage, there's an...",Diamant Bleu,91,70.0,Champagne,Champagne,,Champagne Blend,Heidsieck & Co Monopole
150926,150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain


# Exercise: Descriptions + scores == correlated?

1. Calculate the number of characters in each description. Find, using the `quantile` method, the number of characters in the top 25% of description lengths.  For those descriptions, get the mean `points`.
2. Now calculate the mean `points` for the shortest 25% of descriptions. Do we see any (ridiculous) correlation?

In [20]:
df.loc[  
    df['description'].str.len() >= df['description'].str.len().quantile(0.75)      # row selector
,
    'points'    # column selector
].mean()

90.02646323819545

In [21]:
df.loc[  
    df['description'].str.len() <= df['description'].str.len().quantile(0.25)      # row selector
,
    'points'    # column selector
].mean()

85.838828701523

In [22]:
# what other methods do we have?

df['country'].str.lower()

0             us
1          spain
2             us
3             us
4         france
           ...  
150925     italy
150926    france
150927     italy
150928    france
150929     italy
Name: country, Length: 150930, dtype: object

In [23]:
df['country'].str.upper()

0             US
1          SPAIN
2             US
3             US
4         FRANCE
           ...  
150925     ITALY
150926    FRANCE
150927     ITALY
150928    FRANCE
150929     ITALY
Name: country, Length: 150930, dtype: object

In [27]:
# .str.contains lets us search inside of a string for contents
# it's sort of like the "in" operator in Python for strings

df.loc[
   df['description'].str.contains('x')
]

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
5,5,Spain,"Deep, dense and pure from the opening bell, th...",Numanthia,95,73.0,Northern Spain,Toro,,Tinta de Toro,Numanthia
6,6,Spain,Slightly gritty black-fruit aromas include a s...,San Román,95,65.0,Northern Spain,Toro,,Tinta de Toro,Maurodos
...,...,...,...,...,...,...,...,...,...,...,...
150921,150921,France,Shows some older notes: a bouquet of toasted w...,Blanc de Blancs Brut Mosaïque,91,38.0,Champagne,Champagne,,Champagne Blend,Jacquart
150922,150922,Italy,Made by 30-ish Roberta Borghese high above Man...,Superiore,91,,Northeastern Italy,Colli Orientali del Friuli,,Tocai,Ronchi di Manzano
150923,150923,France,"Rich and toasty, with tiny bubbles. The bouque...",Demi-Sec,91,30.0,Champagne,Champagne,,Champagne Blend,Jacquart
150924,150924,France,"Really fine for a low-acid vintage, there's an...",Diamant Bleu,91,70.0,Champagne,Champagne,,Champagne Blend,Heidsieck & Co Monopole


In [None]:
df.dropna(subset='country').loc[   
        df.dropna(subset='country')['country'].str.contains('b')  
]

In [35]:
# let's say we want to find rows where the country contains "j"

(
    df
    .dropna(subset='country')
    .loc[   
        df.dropna(subset='country')['country'].str.contains('b')  
    ]
)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
2523,2523,Lebanon,"Deep ruby in color, with aromas of black cherr...",Altitudes,91,20.0,Lebanon,,,Red Blend,Ixsir
2616,2616,Serbia,This white wine from Serbia has aromas of cara...,,87,26.0,Pocerina,,,Morava,Milijan Jelić
2664,2664,Serbia,"White flowers, eucalyptus and a slight whiff o...",Margus Margi,89,20.0,Župa,,,Riesling,Budimir
2666,2666,Serbia,A blend of 60% Prokupac and 40% Cabernet Sauvi...,Sub Rosa,89,40.0,Župa,,,Red Blend,Budimir
2841,2841,Serbia,A blend of 60% Prokupac and 40% Cabernet Sauvi...,Sub Rosa,89,40.0,Župa,,,Red Blend,Budimir
...,...,...,...,...,...,...,...,...,...,...,...
126992,126992,Lebanon,"Simple, solid aromas of cherry, cinnamon and s...",Hochar Père et Fils,83,25.0,Bekaa Valley,,,Red Blend,Château Musar
127344,127344,Lebanon,This white blend from Lebanon has a nose of al...,Gaston Hochar,81,38.0,Bekaa Valley,,,White Blend,Château Musar
134565,134565,Luxembourg,"Starts with aromas of minerals, pear and apple...",Domaine et Tradition Machtum Hohfels,88,50.0,Moselle Luxembourgeoise,,,Pinot Gris,Mme Aly Duhr et Fils
137278,137278,Luxembourg,"Offers aromas of honey, almond and white fruit...",Ahn Hohfels Grand Premier Cru,87,36.0,Moselle Luxembourgeoise,,,Pinot Gris,Mme Aly Duhr et Fils


In [36]:
# improve this with method chaining
# plus some use of lambda (anonymous function)

(
    df
    .dropna(subset='country')
    .loc[   
        lambda df_: df_['country'].str.contains('b')  # here, our use of lambda (anonymous function) and df_ means -- use what we got in the chain
    ]
)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
2523,2523,Lebanon,"Deep ruby in color, with aromas of black cherr...",Altitudes,91,20.0,Lebanon,,,Red Blend,Ixsir
2616,2616,Serbia,This white wine from Serbia has aromas of cara...,,87,26.0,Pocerina,,,Morava,Milijan Jelić
2664,2664,Serbia,"White flowers, eucalyptus and a slight whiff o...",Margus Margi,89,20.0,Župa,,,Riesling,Budimir
2666,2666,Serbia,A blend of 60% Prokupac and 40% Cabernet Sauvi...,Sub Rosa,89,40.0,Župa,,,Red Blend,Budimir
2841,2841,Serbia,A blend of 60% Prokupac and 40% Cabernet Sauvi...,Sub Rosa,89,40.0,Župa,,,Red Blend,Budimir
...,...,...,...,...,...,...,...,...,...,...,...
126992,126992,Lebanon,"Simple, solid aromas of cherry, cinnamon and s...",Hochar Père et Fils,83,25.0,Bekaa Valley,,,Red Blend,Château Musar
127344,127344,Lebanon,This white blend from Lebanon has a nose of al...,Gaston Hochar,81,38.0,Bekaa Valley,,,White Blend,Château Musar
134565,134565,Luxembourg,"Starts with aromas of minerals, pear and apple...",Domaine et Tradition Machtum Hohfels,88,50.0,Moselle Luxembourgeoise,,,Pinot Gris,Mme Aly Duhr et Fils
137278,137278,Luxembourg,"Offers aromas of honey, almond and white fruit...",Ahn Hohfels Grand Premier Cru,87,36.0,Moselle Luxembourgeoise,,,Pinot Gris,Mme Aly Duhr et Fils


In [38]:
# what if I want wines from countries containing either 'j' or "J"?

df.loc[ df['country'].str.contains('j', case=False).value_counts()   # now it'll be case insensitive

country
False    150923
True          2
Name: count, dtype: int64