# Agenda

1. Pandas and strings (and their dtypes)
2. String methods
3. Memory and strings
4. Categories

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [4]:
filename = '../data/winemag-150k-reviews.csv'
df = pd.read_csv(filename)

In [5]:
df.shape

(150930, 11)

In [6]:
df.columns

Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
       'price', 'province', 'region_1', 'region_2', 'variety', 'winery'],
      dtype='object')

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [8]:
df.dtypes  # what are the dtypes for this data frame?

Unnamed: 0       int64
country         object
description     object
designation     object
points           int64
price          float64
province        object
region_1        object
region_2        object
variety         object
winery          object
dtype: object

When we read data from a CSV file, Pandas as to guess what the dtype is. It'll guess one of three things:

- `int64` if it sees only digits
- `float64` if it sees digits and a decimal point
- `object` for everything else

But `object` is special. `object` means that we're actually using a Python object, and that Pandas is referring to that object from its special, high-speed, compact memory.

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [11]:
type(df.loc[0, 'description'])

str

In [16]:
filename = '../data/winemag-150k-reviews.csv'
df = pd.read_csv(filename, dtype={'points':'int16', 'price':float})

In [17]:
df.dtypes

Unnamed: 0       int64
country         object
description     object
designation     object
points           int16
price          float64
province        object
region_1        object
region_2        object
variety         object
winery          object
dtype: object

How can we work with our strings?

We want to avoid using a `for` loop at almost any cost.

We always want to take advantage of vectorization in Pandas.

The way we do that with strings is with the `.str` attribute. This, attached to a series, gives us access to a huge number of methods, both those that Python provides and a bunch of others.

In [18]:
# if we want to lowercase something, we can use .str.lower -- just like in Python

df['description'].str.lower()

0         this tremendous 100% varietal wine hails from ...
1         ripe aromas of fig, blackberry and cassis are ...
2         mac watson honors the memory of a wine once ma...
3         this spent 20 months in 30% new french oak, an...
4         this is the top wine from la bégude, named aft...
                                ...                        
150925    many people feel fiano represents southern ita...
150926    offers an intriguing nose with ginger, lime an...
150927    this classic example comes from a cru vineyard...
150928    a perfect salmon shade, with scents of peaches...
150929    more pinot grigios should taste like this. a r...
Name: description, Length: 150930, dtype: object

In [19]:
df['description'].str.upper()

0         THIS TREMENDOUS 100% VARIETAL WINE HAILS FROM ...
1         RIPE AROMAS OF FIG, BLACKBERRY AND CASSIS ARE ...
2         MAC WATSON HONORS THE MEMORY OF A WINE ONCE MA...
3         THIS SPENT 20 MONTHS IN 30% NEW FRENCH OAK, AN...
4         THIS IS THE TOP WINE FROM LA BÉGUDE, NAMED AFT...
                                ...                        
150925    MANY PEOPLE FEEL FIANO REPRESENTS SOUTHERN ITA...
150926    OFFERS AN INTRIGUING NOSE WITH GINGER, LIME AN...
150927    THIS CLASSIC EXAMPLE COMES FROM A CRU VINEYARD...
150928    A PERFECT SALMON SHADE, WITH SCENTS OF PEACHES...
150929    MORE PINOT GRIGIOS SHOULD TASTE LIKE THIS. A R...
Name: description, Length: 150930, dtype: object

In [20]:
# You can assign the result of a method call to a new column, or to (replace) the existing column

df['description_upper'] = df['description'].str.upper()

In [21]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,description_upper
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,THIS TREMENDOUS 100% VARIETAL WINE HAILS FROM ...
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,"RIPE AROMAS OF FIG, BLACKBERRY AND CASSIS ARE ..."
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,MAC WATSON HONORS THE MEMORY OF A WINE ONCE MA...
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,"THIS SPENT 20 MONTHS IN 30% NEW FRENCH OAK, AN..."
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,"THIS IS THE TOP WINE FROM LA BÉGUDE, NAMED AFT..."


In [22]:
s = Series('10 20 hello 30 40'.split())
s

0       10
1       20
2    hello
3       30
4       40
dtype: object

In [25]:
# I want to keep only those values that can be turned into ints
# and then turn them into ints

# Python provides the str.isdigit method, which returns True if the string only contains digits

s.loc[s.str.isdigit()]  # we apply the boolean series to our column

0    10
1    20
3    30
4    40
dtype: object

In [26]:
s.loc[s.str.isdigit()].astype(int)   # now we get a new series containing ints without 'hello'

0    10
1    20
3    30
4    40
dtype: int64

In [28]:
# str.contains is not a Python method, but one that Pandas added 

s.loc[s.str.contains('e')]  # keep only those elements that contain the letter 'e'

2    hello
dtype: object

In [29]:
# another useful method is str.len, which returns the length (as an int) of each string in the series

df['description'].str.len()

0         355
1         318
2         280
3         386
4         376
         ... 
150925    285
150926    266
150927    397
150928    253
150929    203
Name: description, Length: 150930, dtype: int64

# Exercises: Pandas strings

1. Load the = '../data/winemag-150k-reviews.csv' file into a data frame.
2. How many reviews contain the letter 'x'?
3. What is the median review length? What is the mean review length?



In [30]:
df = pd.read_csv(filename)

In [33]:
# which rows of the data frame have a description containing x

df.loc[df['description'].str.contains('x')]

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
5,5,Spain,"Deep, dense and pure from the opening bell, th...",Numanthia,95,73.0,Northern Spain,Toro,,Tinta de Toro,Numanthia
6,6,Spain,Slightly gritty black-fruit aromas include a s...,San Román,95,65.0,Northern Spain,Toro,,Tinta de Toro,Maurodos
...,...,...,...,...,...,...,...,...,...,...,...
150921,150921,France,Shows some older notes: a bouquet of toasted w...,Blanc de Blancs Brut Mosaïque,91,38.0,Champagne,Champagne,,Champagne Blend,Jacquart
150922,150922,Italy,Made by 30-ish Roberta Borghese high above Man...,Superiore,91,,Northeastern Italy,Colli Orientali del Friuli,,Tocai,Ronchi di Manzano
150923,150923,France,"Rich and toasty, with tiny bubbles. The bouque...",Demi-Sec,91,30.0,Champagne,Champagne,,Champagne Blend,Jacquart
150924,150924,France,"Really fine for a low-acid vintage, there's an...",Diamant Bleu,91,70.0,Champagne,Champagne,,Champagne Blend,Heidsieck & Co Monopole


In [34]:
# which descriptions contain 'x'

df.loc[
    df['description'].str.contains('x'),    # row selector
    'description'                           # column selector
]

2         Mac Watson honors the memory of a wine once ma...
3         This spent 20 months in 30% new French oak, an...
4         This is the top wine from La Bégude, named aft...
5         Deep, dense and pure from the opening bell, th...
6         Slightly gritty black-fruit aromas include a s...
                                ...                        
150921    Shows some older notes: a bouquet of toasted w...
150922    Made by 30-ish Roberta Borghese high above Man...
150923    Rich and toasty, with tiny bubbles. The bouque...
150924    Really fine for a low-acid vintage, there's an...
150927    This classic example comes from a cru vineyard...
Name: description, Length: 52795, dtype: object

In [36]:
# What is the median review length? What is the mean review length?

df['description'].str.len().describe()

count    150930.000000
mean        240.373948
std          69.196308
min          17.000000
25%         193.000000
50%         236.000000
75%         282.000000
max         829.000000
Name: description, dtype: float64

In [37]:
# mean/median review lengths where the review contains 'x'

df.loc[
    df['description'].str.contains('x'),    # row selector
    'description'                           # column selector
].str.len().describe()

count    52795.000000
mean       263.258945
std         70.237918
min         50.000000
25%        214.000000
50%        257.000000
75%        305.000000
max        764.000000
Name: description, dtype: float64