# Subsetting and Descriptive Stats

## Before your start:
   - Remember that you just need to do one of the challenges.
   - Keep in mind that you need to use some of the functions you learned in the previous lessons.
   - All datasets are provided in IronHack's database.
   - Elaborate your codes and outputs as much as you can.
   - Try your best to answer the questions and complete the tasks and most importantly: enjoy the process!
   
#### Import all the necessary libraries here:

In [None]:
# Imports
import pandas as pd
import numpy as np

# [ONLY ONE MANDATORY] Challenge 1
#### In this challenge we will use the `Temp_States`  dataset. 

#### First import it into a dataframe called `temp`.

In [None]:
temp = pd.read_csv('Temp_States.csv')

#### Print `temp`.

In [None]:
display(temp)

#### Explore the data types of the *temp* dataframe. What types of data do we have? Comment your result.

In [None]:
temp.dtypes

In [None]:
# ANSWER:
# 'City' and 'State' are listed as objects.
# By importing data, pandas can't be sure whether the 'text column' might contain some other data types.
# To deal with this it assigns 'text columns' the data type 'object' by default.
# 'Temperature' is of data type float (using 64 bits)

#### Select the rows where state is New York.

In [None]:
nyc_temp = temp[temp['State'] == 'New York']

nyc_temp

#### What is the average temperature of cities in New York?

In [None]:
np.mean(nyc_temp['Temperature'])

#### Which states and cities have a temperature above 15 degrees Celsius?

In [None]:
temp_above_15 = temp[['City', 'State']][temp['Temperature'] > 15]

temp_above_15

#### Now, return only the cities that have a temperature above 15 degrees Celsius.

In [None]:
temp_above_15['City']

#### Which cities have a temperature above 15 degrees Celcius and below 20 degrees Celsius?

**Hint**: First, write the condition. Then, select the rows.

In [None]:
temp[['City', 'Temperature']][(temp['Temperature'] > 15) & (temp['Temperature'] < 20)]

#### Find the mean and standard deviation of the temperature of each state.

In [None]:
temp.groupby('State')['Temperature'].agg([np.mean, np.std]).reset_index()

# [ONLY ONE MANDATORY]  Challenge 2

#### Load the `employees` dataset into a dataframe. Call the dataframe `employees`.

In [None]:
employees = pd.read_csv('Employee.csv')

employees.head()

#### Explore the data types of the `employees` dataframe. Comment your results.

In [None]:
employees.dtypes

In [None]:
# ANSWER:
# 'Name', 'Department', 'Education', 'Gender' and 'Title' are strings again (objects by default).
# 'Years' and 'Salary' are of data type int64 (integers using 64 bit).

#### What's the average salary in this company?

In [None]:
np.mean(employees['Salary'])

#### What's the highest salary?

In [None]:
employees['Salary'].max()

#### What's the lowest salary?

In [None]:
salary_min = employees['Salary'].min()

salary_min

#### Who are the employees with the lowest salary?

In [None]:
employees[['Name', 'Salary']][employees['Salary'] == salary_min]

#### Find all the information about an employee called David.

In [None]:
david = employees[employees['Name'] == 'David']

david

#### Could you return only David's salary?

In [None]:
david['Salary'].values[0]

#### Print all the rows where job title is associate.

In [None]:
employees[employees['Title'] == 'associate']

#### Print the first 3 rows of your dataframe.
**Tip**: There are 2 ways to do it. Do it both ways.

In [None]:
employees.head(3)

In [None]:
employees.iloc[:3]

#### Find the employees whose title is associate and whose salary is above 55.

In [None]:
employees[(employees['Title'] == 'associate') & (employees['Salary'] > 55)]

#### Group the employees by number of years of employment. What are the average salaries in each group?

In [None]:
employees.groupby('Years')['Salary'].agg(np.mean).reset_index()

####  What is the average salary per title?

In [None]:
employees.groupby('Title')['Salary'].agg(np.mean).reset_index()

####  Find the salary quartiles.


In [None]:
[employees['Salary'].quantile(quartile) for quartile in [0.25, 0.50, 0.75, 1]]

#### Is the mean salary different per gender?

In [None]:
employees.groupby('Gender')['Salary'].agg(np.mean).reset_index()

#### Find the minimum, mean and maximum of all numeric columns for each company department.



In [None]:
employees.groupby('Department').agg({'Years': [np.min, np.mean, np.max],
                                     'Salary': [np.min, np.mean, np.max]}).reset_index()

#### Bonus Question:  for each department, compute the difference between the maximum and the minimum salary.
**Hint**: try using `agg` or `apply` combined with `lambda` functions.

In [None]:
employees.groupby('Department')['Salary'].agg(np.ptp).reset_index()

# [ONLY ONE MANDATORY] Challenge 3
#### Open the `Orders` dataset. Name your dataset `orders`.

In [None]:
orders = pd.read_csv('Orders.csv')

orders.drop(['Unnamed: 0'], inplace = True, axis = 1)

orders.head()

#### Explore your dataset by looking at the data types and summary statistics. Comment your results.

In [None]:
display(orders.dtypes)

orders.describe()

In [None]:
# ANSWER:

# int64: 'InvoiceNo', 'year', 'month', 'day', 'hour', 'Quantity', 'CustomerID'
# object: 'StockCode', 'Description', 'InvoiceDate', 'Country'
# float64: 'UnitPrice', 'amount_spent'
# 'year', 'month', 'day', 'hour' should be made into one column of type 'date'

# Orders are from 2010 - 2011
# Regarding mean (month) there seem to be more orders in the second half of the year
# Regarding mean / standard deviation there must be outliers in 'Quantity', 'UnitPrice' and 'amount_spent'

####  What is the average purchase price?

In [None]:
np.mean(orders['amount_spent'])

#### What are the highest and lowest purchase prices? 

In [None]:
print(orders['amount_spent'].min())
print(orders['amount_spent'].max())

#### Select all the customers from Spain.
**Hint**: Remember that you are not asked to find orders from Spain but customers. A customer might have more than one order associated. 

In [None]:
customers_spain = orders[orders['Country'] == 'Spain']['CustomerID'].unique()

print(customers_spain)

#### How many customers do we have in Spain?

In [None]:
len(customers_spain)

#### Select all the customers who have bought more than 50 items.
**Hint**: Remember that you are not asked to find orders with more than 50 items but customers who bought more than 50 items. A customer with two orders of 30 items each should appear in the selection.

In [None]:
items_50 = orders.groupby('CustomerID')['Quantity'].agg('sum').reset_index()

items_50[items_50['Quantity'] > 50]['CustomerID']

#### Select orders from Spain that include more than 50 items.

In [None]:
orders[(orders['Country'] == 'Spain') & (orders['Quantity'] > 50)]

#### Select all free orders.

In [None]:
orders[orders['amount_spent'] == 0]

#### Select all orders whose description starts with `lunch bag`.
**Hint**: use string functions.

In [None]:
orders_lunch_bag = orders[orders['Description'].str.startswith('lunch bag')]

orders_lunch_bag

#### Select all `lunch bag` orders made in 2011.

In [None]:
orders_lunch_bag[orders_lunch_bag['year'] == 2011]

#### Show the frequency distribution of the amount spent in Spain.

In [None]:
orders[orders['Country'] == 'Spain']['amount_spent'].value_counts()

#### Select all orders made in the month of August.

In [None]:
orders[orders['month'] == 8]

#### Find the number of orders made by each country in the month of August.
**Hint**: Use value_counts().

In [None]:
orders[orders['month'] == 8].groupby('Country')['InvoiceNo'].agg('count')

#### What's the  average amount of money spent by country?

In [None]:
orders.groupby('Country')['amount_spent'].agg('mean')

#### What's the most expensive item?

In [None]:
orders['UnitPrice'].max()

#### What is the average amount spent per year?

In [None]:
orders.groupby('year')['amount_spent'].agg('mean')