# DATA 1 Practical 3 - Questions

Simos Gerasimou


## Wine Exploration

**WineEnthusiast** is a website for buying wine products and in which customers can also review products. The company has collected reviews for a wide variety of their products on November 22nd, 2017. The company wants to analyse this data to extract insights from its products and answer questions including:
* how its products are rated by customers?
* are there patterns that might increase its revenue and/or profit?

#### Your tasks are to explore this dataset and generade actionable knowledge. 


This Jupyter Notebook will be presented to the WineEnthusiast main stakeholders who have limited knowledge about data science. Your findings should be complemented by a suitable justification explaining what you observe and, when applicable, what this observation means and, possibly, why it occurs.


***

### **Important Information**

(1) To answer these exercises, you **must first read Chapter 2: Introduction to NumPy from the Python Data Science Handbook** (https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)


(2) For each question (task) a description is provided accompanied (most of the time) by two cells: one for writing the Python code and another for providing the justification. Feel free to add more cells if you feel they are needed, but keep the cells corresponding to the same question close by.

**Hint**: If you find difficulties in solving a task, look at Chapter 2 from the Python Data Science Handbook.


#### **T1) Explore the dataset and for each column write its name, data type (categorical/numerical - nominal,ordinal,discrete,continuous) and its meaning (i.e., what does it capture?)**

* You may want to open the CSV file using a text editor (e.g., Notepad) or a spreadsheet editor (e.g., Excel)

**Write your answer here**


### 1) Reading dataset

The classic cars dataset is available on VLE (look for "wine-data-filtered-500.csv" in the Practicals section)

In [3]:
#Using NumPy to read the dataset
import numpy as np
#Define the path to the dataset
data_path = "wine-data-filtered-500.csv"
#Define the type of each dataset column. 
#This is needed because NumPy arrays cannot directly read files with different data types
#Hence, we are using Structured arrays. 
#But, we will soon move to Pandas which makes data manipulation easier
types = ['i4', 'U30', 'i4', 'i4', 'U50', 'U50', 'U100', 'U100', 'U100']
#Read the dataset
data = np.genfromtxt(data_path, dtype=types, delimiter=',', names=True)
print(data.dtype)

[('ID', '<i4'), ('country', '<U30'), ('points', '<i4'), ('price', '<i4'), ('province', '<U50'), ('tasterName', '<U50'), ('title', '<U100'), ('variety', '<U100'), ('winery', '<U100')]


##### **Since we are using Structured Arrays, we can extract the entries of a column by specifying its name. We can further slice the array by using the standard [Python slicing mechanism](https://www.w3schools.com/python/numpy_array_slicing.asp)**



In [None]:
#Print the first 5 entries with 
print(data[0:5])

In [None]:
#Print the first ten wine titles
print(data['title'][0:10])

***
### **How do the wine prices look like?**


#### **T2) Calculate the mean and median prices for all the wines**

In [8]:
#Write your answer here
mean = np.sum(data['price']) / np.size(data['price'])
median = np.sort(data['price'])[np.size(data['price'])//2]
mean, median


(42.428, 30)

#### **T3) Calculate the min, max, range and standard deviation of wine prices**

In [9]:
#Write your answer here
minimum = np.min(data['price'])
maximum = np.max(data['price'])
range_values = maximum - minimum
standard_dev = np.std(data['price'])
minimum, maximum, range_values, standard_dev


(7, 775, 768, 60.51959034891099)

#### **T4) What insights can you extract from these values? Which metric of central tendency should we use?**

**Write your answer here**
There is a very large range of prices and the standard deviation is relatively low in comparison to the range, i think the standard deviation is the best to use.


***
### **What do the reviewers think about the quality of wines?**

#### **T5) Calculate the metrics of central tendency for wine ratings (points)**

In [16]:
#Write your answer here 
import pandas as pd

df = pd.read_csv('wine-data-filtered-500.csv')

mean = df["points"].mean()
median = df["points"].median()
mode = df["points"].mode()
mean, median, mode


(89.244,
 89.0,
 0    87
 dtype: int64)

#### **T6) Calculate the metrics of dispersion for wine ratings (points)**

In [18]:
#Write your answer here 
standard_dev = np.std(data["points"])
q25,q75 = np.percentile(data["points"], [25, 75])
iqr = q75 - q25
variance = np.var(data["points"])


#### **T7) Calculate the interquartile range for the ratings of all reviewed wines**

In [19]:
#Write your answer here
q25,q75 = np.percentile(data["points"], [25, 75])
iqr = q75 - q25
iqr


4.0

#### **T8) What insights can you extract from these values? Which metric of central tendency should we use?**

**Write your answer here**


### **Further Analysis**

#### **T9) How many wine varieties have been reviewed?**

In [20]:
#Write your answer here
len(np.unique(data["variety"]))


91

#### **T10) Which is the most reviewed wine variety and what is its mean rating?**

* Hint: Check the section on array masking from the NumPy chapter in the Python Data Science Handbook

In [49]:
#Write your answer here
mode = df["variety"].mode()
x = data["variety"]
np.argwhere(data["variety"]==mode[0])



array([[  3],
       [ 19],
       [ 23],
       [ 28],
       [ 30],
       [ 37],
       [ 57],
       [102],
       [109],
       [124],
       [128],
       [167],
       [172],
       [173],
       [175],
       [180],
       [190],
       [193],
       [204],
       [213],
       [220],
       [234],
       [241],
       [257],
       [263],
       [306],
       [312],
       [316],
       [317],
       [346],
       [347],
       [348],
       [349],
       [353],
       [354],
       [357],
       [360],
       [363],
       [364],
       [369],
       [370],
       [371],
       [372],
       [379],
       [388],
       [392],
       [394],
       [414],
       [443],
       [444],
       [459],
       [470],
       [487],
       [489],
       [499]], dtype=int64)

#### **T11) Which are the most widely reviewed wineries? How many reviews did each receive?**

* Hint: Check the section on array masking from the NumPy chapter in the Python Data Science Handbook
* Hint: Another option is to use the function argwhere function from NumPy (https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html)

In [None]:
#Write your answer here


#### **T12) Which reviewed wines are white?**

* Hint: Which variable of a wine may contain this information?

In [None]:
#Write your answer here


#### **T13) How many tasters (sommelliers) have reviewed wines produced by the "Winzer Krems" winery?**

In [None]:
#Write your answer here
data[]


#### **T14) What can you infer about the ratings given by the sommelliers for wines produced by "Le Cadeau"? How much confidence would you have about these reviews?**

In [None]:
#Write your answer here


#### **T15) Which country's the wines have received the most reviews with rating above 95? How much do these wines cost on average?**

In [12]:
#Write your answer here

IndexCtr = np.where(data['points'] > 95)
countries, Ctrcount = np.unique(data['country'][IndexCtr], return_counts=True)
modeIndexCtr = np.argmax(Ctrcount)
modeCtr = countries[modeIndexCtr]
modeCountCtr = countries[modeIndexCtr]

print(f"The country: {modeCtr}")

m = np.where((data['country'] == modeCtr) & (data['points'] >95))

print(f"The average price: {np.mean(data['price'][m])}")



The country: Australia
The average price: 256.25


#### **T16) What is the name (title) of the wine with the highest score? Are there other wines that cost as much as the wine with the highest score? If so, give their names (titles).**

In [14]:
#Write your answer here
index = np.argmax(data['points'])
print(data['title'][index])


Chambers Rosewood Vineyards NV Rare Muscat (Rutherglen)


#### **T17) How many wines from Italy have a rating above the 90th percentile and from which province do the wines come from?**

In [15]:
#Write your answer here
percentile = np.percentile(data['points'], 99)


AttributeError: 'numpy.ndarray' object has no attribute 'lower'

#### **T18) What is the average rating given by each sommellier?**

In [None]:
#Write your answer here


#### **T19) Who is the sommellier with the highest average rating and how many reviews has he/she written?**

In [None]:
##Write your answer here


#### **T20) Which US province has received the highest number of wine reviews?**

In [None]:
#Write your answer here


#### **T21) Who are the sommelliers with no rating above 90?**

* Hint: You may want to look at https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html#Counting-entries

In [None]:
#Write your answer here


### Ideas for practicing further at home

* Find the tasters (sommellier) who provided the most reviews and the highest
* Find which is the winery that received the highest number of independent reviews
* Find the average rating of each winery, and the wineries with the highest and lowest average ratings