# Summary Functions and Maps

This section covers different operations we can use to reformat data correctly for the task at hand.

In [3]:
import pandas as pd
pd.set_option('display.max_rows', 5)
import numpy as np
reviews = pd.read_csv("./data/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

## Summary Functions

`pandas` provides many simple 'summary functions' that restructure data. 

For example, the `describe()` method generates a high-level summary of the attributes of the given column:

In [4]:
reviews.points.describe()

count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

It is also type-aware, so for string based data it returns something different but more relevant:

In [5]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

If you want a particular simple summary statistic about a column, there is usually a helpful `panda` function. 

For example, the mean:

In [6]:
reviews.points.mean()

88.44713820775404

Or a list of the unique values:

In [7]:
reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

For a list of unique values and how often they occur:

In [8]:
reviews.taster_name.value_counts()

taster_name
Roger Voss           25514
Michael Schachner    15134
                     ...  
Fiona Adams             27
Christina Pickard        6
Name: count, Length: 19, dtype: int64

## Maps

In data science, maps are used for creating new representations from existing data or transforming data between formats.

There are two mapping methods that are commonly used - `map()` and `apply()`.

Starting with `map()`, we could remean the scores the wines received to 0 as follows:

In [10]:
review_points_mean = reviews.points.mean()

reviews.points.map(lambda p: p - review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

The function passed to `map()` should expect a single value from the Series and return a transformed version of that value. `map()` returns a new Series of the transformed values.

`apply()` is used for transforming a whole DataFrame by calling a custom method on each row. 

To do a similar process of remeaning as above using `apply()` instead:

In [11]:
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,1.552862,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,1.552862,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


To transform each column of the DataFrame instead, we could call `reviews.apply()` with `axis='index'`.

Note that `map()` and `apply()` return new Series/DataFrames and don't mutate the original data.

`pandas` has many common mapping operations as built-ins.

For example, we could remean the points column like this:

In [13]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

`pandas` detects the operation between many values on the left and a single value on the right and presumes we want to subtract the mean from each value in the column. 

All of the standard Python operators can work in this manner. 

For example, performing this operation between Series of equal length can combine the two columns of information:

In [14]:
reviews.country + " - " + reviews.region_1

0            Italy - Etna
1                     NaN
               ...       
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object

These operators are faster than `map()` or `apply()` but less flexible when it comes to advanced uses such as conditional logic.

# Exercises

1. What is the median of the `points` column in the `reviews` DataFrame?

In [15]:
median_points = reviews.points.median()

median_points

88.0

2. What countries are represented in the dataset? Note: there shouldn't be duplications!

In [16]:
countries = reviews.country.unique()

countries

array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

3. How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [19]:
reviews_per_country = reviews.country.value_counts()

reviews_per_country

country
US        54504
France    22093
          ...  
China         1
Egypt         1
Name: count, Length: 43, dtype: int64

4. Create variable `centered_price` containing a version of the `price` column with the mean price subtracted:

In [20]:
review_price_mean = reviews.price.mean()

centered_price = reviews.price - review_price_mean

centered_price

0               NaN
1        -20.363389
            ...    
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

5. Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio:

In [25]:
points_to_price_ratio = reviews.points / reviews.price

bargain_wine = reviews.title.iloc[points_to_price_ratio.idxmax()]

bargain_wine

'Bandit NV Merlot (California)'

6. Create a Series `descriptor_counts` counting how many times the words 'tropical' or 'fruity' appear in the `description` column of the dataset:

In [31]:
tropical_count = reviews.description.map(lambda description: description.count('tropical')).sum()
fruity_count = reviews.description.map(lambda description: description.count('fruity')).sum()

descriptor_counts = pd.Series([tropical_count, fruity_count], index=['tropical', 'fruity'])

descriptor_counts

tropical    3703
fruity      9259
dtype: int64

7. Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

- greater than or equal to 95 - 3 stars
- greater than or equal to 85 - 2 stars
- less than 85 - 1 star

Plus any wines from Canada should automatically get 3 stars!:

In [40]:
def set_stars(row):
    if (row.country == 'Canada' or row.points >= 95):
        return 3
    elif (row.points >= 85):
        return 2
    else:
        return 1

star_ratings = reviews.apply(set_stars, axis='columns')

star_ratings

0         2
1         2
         ..
129969    2
129970    2
Length: 129971, dtype: int64