**This notebook is an exercise in the [Pandas](https://www.kaggle.com/learn/pandas) course.  You can reference the tutorial at [this link](https://www.kaggle.com/residentmario/summary-functions-and-maps).**

---


# Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

In [None]:
import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

# Exercises

## 1.

What is the median of the `points` column in the `reviews` DataFrame?

In [None]:
median_points = reviews.points.median()

# Check your answer
q1.check()

In [None]:
#q1.hint()
#q1.solution()

## 2. 
What countries are represented in the dataset? (Your answer should not include any duplicates.)

In [None]:
countries = pd.Series(reviews['country'].unique()).sort_values()

# Check your answer
q2.check()

print('Countries:\n', *enumerate(countries, 1), sep='\n')

In [None]:
q2.hint()
q2.solution()

## 3.
How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [None]:
pd.set_option('display.max_rows', 20)

In [None]:
reviews_per_country = reviews['country'].value_counts()

# Check your answer
q3.check()

reviews_per_country.head(20).sort_values(ascending=False)

In [None]:
q3.hint()
q3.solution()

## 4.
Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.) 

In [None]:
centered_price = reviews['price'] - reviews['price'].mean()

# Check your answer
q4.check()

print('centered_price:\n',centered_price, sep='')
print('\nmean of centered_price:',round(centered_price.mean()))

In [None]:
q4.hint()
q4.solution()

## 5.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

In [None]:
print('Highest points-to-price:',(reviews['points']/reviews['price']).max(skipna=True))

In [None]:
bargain_wines = reviews.loc[ reviews['points']/reviews['price'] == (reviews['points']/reviews['price']).max(skipna=True) ]
bargain_wines

In [None]:
q5.check()

In [None]:
q5.hint()

In [None]:
q5.solution()

In [None]:
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
bargain_wine

## 6.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset.

In [None]:
# normalizing words to lowercase, and removing punctuation from the word 
parsing_test = list(map(lambda x: [word.lower().rstrip(',.?:;') for word in x.split()], reviews['description']))[0]
parsing_test

In [None]:
in_test = list(map(lambda x: 'tropical' in [word.lower().rstrip(',.?:;') for word in x.split()], reviews['description']))[0]
in_test

In [None]:
tropic_counts = sum(map(lambda desc: 'tropical' in [word.lower().rstrip(',.?:;') for word in desc.split()], reviews['description']))
fruity_counts = sum(map(lambda desc: 'fruity'   in [word.lower().rstrip(',.?:;') for word in desc.split()], reviews['description']))

descriptor_counts = pd.Series( [tropic_counts, fruity_counts],
                              index = ['tropical', 'fruity'],
                              name = 'Descriptor counts')
# Check your answer
q6.check()

descriptor_counts

In [None]:
q6.hint()
#q6.solution()

In [None]:
q6.solution()

In [None]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

# Check your answer
q6.check()

descriptor_counts

In [None]:
test_description = pd.Series( ["tropical" * 5] * 3 )
n_trop = test_description.map(lambda desc: "tropical" in desc).sum()
n_trop

## 7.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [None]:
# 3 stars: >= 95  |  From Canada
# 2 stars: >= 85
# 1 star

star_ratings = [0] * len(reviews)

for i in range(len(reviews)):
    if (reviews['points'][i] >= 95) | (reviews['country'][i] == 'Canada'):
        star_ratings[i] = 3
    elif reviews['points'][i] >= 85:
        star_ratings[i] = 2
    else:
        star_ratings[i] = 1

pd.Series(star_ratings).value_counts()

In [None]:
q7.hint()
#q7.solution()

In [None]:
def stars(row):
    if (row.points >= 95) | (row.country == 'Canada'): row.stars = 3
    elif row.points >= 85:                             row.stars = 2
    else:                                              row.stars = 1
    return row.stars
        
star_ratings = reviews.apply(stars, axis='columns')
                                                 
# Check your answer
q7.check()

pd.Series(star_ratings).value_counts()

In [None]:
q7.solution()

In [None]:
def stars(row):
    if  (row.points >= 95) | (row.country == 'Canada'): return 3
    elif row.points >= 85:                              return 2
    else:                                               return 1
        
star_ratings = reviews.apply(stars, axis='columns')
                                                 
# Check your answer
q7.check()

pd.Series(star_ratings).value_counts()

# Keep going
Continue to **[grouping and sorting](https://www.kaggle.com/residentmario/grouping-and-sorting)**.

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161299) to chat with other Learners.*