# Summary functions and maps reference 

## Introduction 

In [3]:
import pandas as pd
pd.set_option('max_rows',5)
import numpy as np
reviews = pd.read_csv("C:/Users/teamo/PycharmProjects/Data-Analyse/Winemagz/winemag-data-130k-v2.csv",index_col=0)
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


## 1.Summary Functions

### 1)describe()

In [28]:
reviews.head().describe()

Unnamed: 0,points,price
count,5.0,4.00
mean,87.0,26.75
...,...,...
75%,87.0,27.50
max,87.0,65.00


- This is a high-level summary of the given column. 
- This output above only makes sense for numerical data; for string data here's what we get:

In [5]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

- If you have narrower needs, there are functions that return more specific information.
- For example, you can see the points allotted (e.g. how well an averagely rated wine does) with the mean function:

### 2)mean()

In [6]:
reviews.points.mean()

88.44713820775404

- To see a list of unique values we can use the unique function:

### 3)unique()

In [7]:
reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

- To see a list of unique values and how often they occur in the dataset, we can use the value_counts method:

### 4)value_counts 

In [12]:
reviews.taster_name.value_counts()

Roger Voss           25514
Michael Schachner    15134
                     ...  
Fiona Adams             27
Christina Pickard        6
Name: taster_name, Length: 19, dtype: int64

## 2.Maps

Define:A "map" is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values.

#### Function:
    - In data science we often have a need for creating new representations from existing data,
    - or for transforming data from the format it is in now to the format that we want it to be in later.
    - Maps are what handle this work, making them extremely important for getting your work done!

- There are two mapping method that you will use often. 
    -  Series.map is the first, and slightly simpler one. 
    - For example, suppose that we wanted to remean the scores the wines recieved to 0. We can do this as follows:

### 1)Series.map 

In [14]:
reviews_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - reviews_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

- The function you pass to map should expect a single value from the Series (a point value, in the above example),
- and return a transformed version of that value.
- map returns a new Series where all the values have been transformed by your function.

### 3)DataFrame.apply 

In [15]:
def remean_points(row):
    row.points = row.points - reviews_points_mean
    return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,1.552862,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,1.552862,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


- DataFrame.apply is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

- If we had called reviews.apply with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

###  Note that Series.map and DataFrame.apply return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on.

- If we look at the first row of reviews, we can see that it still has its original points value.

In [17]:
reviews.head(1)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia


## pandas provides many common mapping operations as built-ins.

- For example, here's a faster way of remeaning our points column:



In [20]:
reviews_points_mean = reviews.points.mean()
reviews.points - reviews_points_mean

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

- In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) 
- and a single value on the right-hand side (the mean value). 

- pandas will also understand what to do if we perform these operations between Series of equal length.
-  For example, an easy way of combining country and region information in the dataset would be to do the following:

In [24]:
reviews.country + "-" + reviews.region_1

0            Italy-Etna
1                   NaN
              ...      
129969    France-Alsace
129970    France-Alsace
Length: 129971, dtype: object

These operators are faster than the map or apply because they uses speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.

However, they are not as flexible as map or apply, which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.