## Hands-on 4
Let's continue to use the dataset from the wine magazine used in lecture to practice data transformation, grouping, and sorting.

<img src="https://secure.static.meredith.com/crt/store/covers/magazines/nmo/9826_l.jpg">

In [15]:
csvurl="https://gist.githubusercontent.com/clairehq/" + \
        "79acab35be50eaf1c383948ed3fd1129/raw/407a02139ae1e134992b90b4b2b8c329b3d73a6a/winemag-data-130k-v2.csv"
import pandas as pd
wine = pd.read_csv(csvurl)
wine.drop(wine.columns[0], axis="columns", inplace=True)
wine.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


#### Question 1: ####  
What is the mean of the points column?

In [16]:
# Calculates the mean of the points column.
wine_points_mean = wine.points.mean()
wine_points_mean

88.43403716087269

#### Question 2: ####  
How many countries are present in this dataset? (Only count each country once)

In [17]:
# Finds the number of unique countries in the dataset.
wine["country"].nunique()

41

#### Question 3: ####
How many times does each country appeared in this dataset? Show each country and the corresponding count (show counts in ascending order)

In [18]:
# Shows how many times each country appears in the dataset with counts in ascending order.
country_counts = wine["country"].value_counts(ascending=True)
country_counts.name = "Country Count"
country_counts

Bosnia and Herzegovina        1
Armenia                       1
Slovakia                      1
Switzerland                   4
Luxembourg                    4
India                         4
Ukraine                       5
Macedonia                     6
Czech Republic                6
Cyprus                        6
Serbia                        7
Peru                          8
Morocco                      11
Lebanon                      20
Moldova                      30
Brazil                       31
Mexico                       31
England                      36
Georgia                      37
Slovenia                     39
Turkey                       43
Croatia                      44
Uruguay                      61
Hungary                      61
Romania                      67
Bulgaria                     68
Canada                      108
Greece                      242
Israel                      259
New Zealand                 733
South Africa                737
Germany 

#### Question 4: ####
Create a variable `adjusted_price` containing the adjusted price which is the price subtracted by the average price. *This is called **"centering" transformation** - a method commonly used in the preprocessing step before applying various machine learning algorithms.*

In [19]:
# Finds the mean price.
wine_price_mean = wine.price.mean()

# Finds the adjusted price, which is the price subtracted by the average price.
adjusted_price = wine.price.map(lambda price: price - wine_price_mean)
adjusted_price.name = "Adjusted Price"
adjusted_price

0              NaN
1       -20.232932
2       -21.232932
3       -22.232932
4        29.767068
           ...    
65494     9.767068
65495   -13.232932
65496   -15.232932
65497    -4.232932
65498   -25.232932
Name: Adjusted Price, Length: 65499, dtype: float64

#### Question 5: ####
What is the title of the wine that has the highest points-to-price ratio in the dataset?

In [20]:
# Creates a new column in the dataframe containing points-to-price ratios.
wine["points-to-price ratio"] = wine["points"] / wine["price"]

# Finds the title of the wine that has the highest points-to-price ratio in the dataset.
wine.loc[wine["points-to-price ratio"].idxmax()]["title"]

'Bandit NV Merlot (California)'

#### Question 6: ####
Create a series `flavor_counts` that contains two values: the number of wines that has the word "tart" in the `description` column and the number of wines that has the word "berries" in the `description` column. The index of the Series should be "Tart" and "Berries" for the corresponding values.

In [21]:
# Finds the number of wines that have the word "tart" in the description column and the number of wines that have the word "berries" in the description column.
# "tart" and "berries" should be case insensitive and not a substring of another word.
tart_count = wine["description"].str.contains(pat=r"\btart\b", case=False, regex=True).sum()
berries_count = wine["description"].str.contains(pat=r"\bberries\b", case=False, regex=True).sum()

# Puts the counts in a list and creates a list for the index.
counts = [tart_count, berries_count]
key_words = ["Tart", "Berries"]

# Creates a series containing the counts and sets the index as "Tart" and "Berries" for the corresponding values.
flavor_counts = pd.Series(counts, index=key_words, name="Flavor Count")
flavor_counts

Tart       3086
Berries    1192
Name: Flavor Count, dtype: int64

#### Question 7: ####
Let's convert the points into simple star ratings. A score of 90 or higher counts as 3 stars, a score of at least 80 but less than 90 is 2 stars. Any other score is 1 star.

Also, any wines from France should automatically get 3 stars, regardless of points.

Add this new column `star_ratings` to the dataframe with the number of stars for each wine in the dataset. 

In [22]:
# Defines a function to convert points into simple star ratings based on the criteria above.
def get_stars(row):
  if row.points >= 90 or row.country == "France":
    return 3
  elif 80 <= row.points < 90:
    return 2
  else:
    return 1

# Applies the function above to each row and stores the star ratings for each row in a column star_ratings.
# Adds star_ratings column to the dataframe.
wine["star_ratings"] = wine.apply(get_stars, axis="columns")
wine

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,points-to-price ratio,star_ratings
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,,2
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,5.800000,2
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,6.214286,2
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,6.692308,2
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,1.338462,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65494,France,Made from young vines from the Vaulorent porti...,Fourchaume Premier Cru,90,45.0,Burgundy,Chablis,,Roger Voss,@vossroger,William Fèvre 2005 Fourchaume Premier Cru (Ch...,Chardonnay,William Fèvre,2.000000,3
65495,Australia,"This is a big, fat, almost sweet-tasting Caber...",,90,22.0,South Australia,McLaren Vale,,Joe Czerwinski,@JoeCz,Tapestry 2005 Cabernet Sauvignon (McLaren Vale),Cabernet Sauvignon,Tapestry,4.090909,3
65496,US,"Much improved over the unripe 2005, Fritz's 20...",Estate,90,20.0,California,Dry Creek Valley,Sonoma,,,Fritz 2006 Estate Sauvignon Blanc (Dry Creek V...,Sauvignon Blanc,Fritz,4.500000,3
65497,US,This wine wears its 15.8% alcohol better than ...,Block 24,90,31.0,California,Napa Valley,Napa,,,Hendry 2004 Block 24 Primitivo (Napa Valley),Primitivo,Hendry,2.903226,3


#### Question 8: ####
Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.

In [23]:
# Finds the most common wine reviewers in the dataset.
wine_reviewers = wine.groupby("taster_twitter_handle").taster_name.count().sort_values(ascending=False)
wine_reviewers.name = "Most Common Wine Reviewers"
wine_reviewers

taster_twitter_handle
@vossroger          13045
@wineschach          7752
@kerinokeefe         5313
@paulgwine           4851
@vboone              4696
@mattkettmann        3035
@JoeCz               2605
@wawinereport        2358
@gordone_cellars     2032
@AnneInVino          1769
@laurbuzz             938
@suskostrzewa         593
@worldwineguys        465
@bkfiona               11
@winewchristina         4
Name: Most Common Wine Reviewers, dtype: int64

#### Question 9: ####
What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}. Sort the values in the Series in descending order based on wine count.

In [24]:
# Finds the combination of countries and varieties that are most common.
combinations = wine.groupby(["country", "variety"]).size().sort_values(ascending=False)
combinations.name = "Most Common Combinations of Countries and Varieties"
combinations

country  variety                 
US       Pinot Noir                  4918
         Cabernet Sauvignon          3649
         Chardonnay                  3412
France   Bordeaux-style Red Blend    2380
Italy    Red Blend                   1870
                                     ... 
         Torbato                        1
         Vespaiolo                      1
         Vespolina                      1
         Vitovska                       1
Uruguay  Tempranillo-Tannat             1
Name: Most Common Combinations of Countries and Varieties, Length: 1304, dtype: int64

#### Question 10 #####
Create a Series whose index is reviewers and whose values is the average score given out by that reviewer. Hint: you will need the `taster_name` and `points` columns.

In [25]:
# Creates a Series whose index is reviewers and whose values is the average score given out by that reviewer.
reviewer_average_scores = wine.groupby("taster_name").points.mean()
reviewer_average_scores.name = "Average Score by Reviewer"
reviewer_average_scores

taster_name
Alexander Peartree    86.014286
Anna Lee C. Iijima    88.380506
Anne Krebiehl MW      90.587903
Carrie Dykes          86.644444
Christina Pickard     89.500000
Fiona Adams           87.090909
Jeff Jenssen          88.273504
Jim Gordon            88.604331
Joe Czerwinski        88.519770
Kerin O’Keefe         88.827969
Lauren Buzzeo         87.831557
Matt Kettmann         90.021087
Michael Schachner     86.904541
Mike DeSimone         89.030303
Paul Gregutt          89.095032
Roger Voss            88.678957
Sean P. Sullivan      88.666243
Susan Kostrzewa       86.408094
Virginie Boone        89.229557
Name: Average Score by Reviewer, dtype: float64