# Subsetting and Descriptive Stats

## Before your start:
   - Remember that you just need to do one of the challenges.
   - Keep in mind that you need to use some of the functions you learned in the previous lessons.
   - All datasets are provided in IronHack's database.
   - Elaborate your codes and outputs as much as you can.
   - Try your best to answer the questions and complete the tasks and most importantly: enjoy the process!
   
#### Import all the necessary libraries here:

In [1]:
# import libraries here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# [ONLY ONE MANDATORY] Challenge 1
#### In this challenge we will use the `Temp_States`  dataset. 

#### First import it into a dataframe called `temp`.

In [3]:
# your code here
temp = pd.read_csv("./Temp_States.csv")

#### Print `temp`.

In [4]:
# your code here
print(temp)

         City        State  Temperature
0         NYC     New York    19.444444
1      Albany     New York     9.444444
2     Buffalo     New York     3.333333
3    Hartford  Connecticut    17.222222
4  Bridgeport  Connecticut    14.444444
5      Treton   New Jersey    22.222222
6      Newark   New Jersey    20.000000


#### Explore the data types of the *temp* dataframe. What types of data do we have? Comment your result.

In [5]:
# your code here
display(temp.dtypes)

City            object
State           object
Temperature    float64
dtype: object

In [None]:
"""
City and State as reflected as "object" because those are string columns, on the other hand we have the temperature as float since it is a number with decimals
"""

#### Select the rows where state is New York.

In [6]:
# your code here
new_york_temp = temp[temp['State'] == 'New York']
new_york_temp.head()

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333


#### What is the average temperature of cities in New York?

In [7]:
# your code here
new_york_temp['Temperature'].mean()

10.74074074074074

#### Which states and cities have a temperature above 15 degrees Celsius?

In [15]:
# your code here
above15_temp = temp[temp['Temperature'] >15 ]
above15_temp.head(10)

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
3,Hartford,Connecticut,17.222222
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Now, return only the cities that have a temperature above 15 degrees Celsius.

In [16]:
# your code here
above15_temp_city = above15_temp.drop('State', axis = 1)
above15_temp_city.head(10) 

Unnamed: 0,City,Temperature
0,NYC,19.444444
3,Hartford,17.222222
5,Treton,22.222222
6,Newark,20.0


#### Which cities have a temperature above 15 degrees Celcius and below 20 degrees Celsius?

**Hint**: First, write the condition. Then, select the rows.

In [19]:
# your code here

temp['selected_city'] = (temp['Temperature'] > 15) & (temp['Temperature'] < 20)
selected_cities = temp[temp['selected_city'] == True]
selected_cities 

Unnamed: 0,City,State,Temperature,selected_city
0,NYC,New York,19.444444,True
3,Hartford,Connecticut,17.222222,True


#### Find the mean and standard deviation of the temperature of each state.

In [23]:
# your code here

temp.groupby('State').agg({'Temperature': ['mean','std']})

Unnamed: 0_level_0,Temperature,Temperature
Unnamed: 0_level_1,mean,std
State,Unnamed: 1_level_2,Unnamed: 2_level_2
Connecticut,15.833333,1.964186
New Jersey,21.111111,1.571348
New York,10.740741,8.133404


# [ONLY ONE MANDATORY]  Challenge 2

#### Load the `employees` dataset into a dataframe. Call the dataframe `employees`.

In [31]:
# your code here
employees = pd.read_csv("./Employee.csv")
employees.head(20)

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
6,Carlos,IT,Master,M,VP,8,70
7,Pedro,IT,Phd,M,associate,7,60
8,Ana,HR,Master,F,VP,8,70


#### Explore the data types of the `employees` dataframe. Comment your results.

In [26]:
# your code here
display(employees.dtypes)

Name          object
Department    object
Education     object
Gender        object
Title         object
Years          int64
Salary         int64
dtype: object

In [None]:
"""
Years and Salary that are numbers with no decimals, meaning int64, the rest of the columns are "object" because those are string columns.

#### What's the average salary in this company?

In [27]:
# your code here
employees['Salary'].mean()

48.888888888888886

#### What's the highest salary?

In [28]:
# your code here

employees['Salary'].max()

70

#### What's the lowest salary?

In [29]:
# your code here

employees['Salary'].min()

30

#### Who are the employees with the lowest salary?

In [34]:
# your code here
sorted_employees = employees.sort_values(by='Salary')
sorted_employees.head(3)

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
0,Jose,IT,Bachelor,M,analyst,1,35


#### Find all the information about an employee called David.

In [36]:
# your code here
employee_david = employees[employees['Name'] == 'David']
employee_david

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
2,David,HR,Master,M,analyst,2,30


#### Could you return only David's salary?

In [38]:
# your code here
david_salary = employee_david.loc[employee_david.index[0], 'Salary']
david_salary

30

#### Print all the rows where job title is associate.

In [39]:
# your code here

employees[employees['Title'] == 'associate']

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
7,Pedro,IT,Phd,M,associate,7,60


#### Print the first 3 rows of your dataframe.
**Tip**: There are 2 ways to do it. Do it both ways.

In [40]:
# Method 1
# your code here
employees.head(3)

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


In [42]:
# Method 2
# your code here

print(employees[:3])

    Name Department Education Gender    Title  Years  Salary
0   Jose         IT  Bachelor      M  analyst      1      35
1  Maria         IT    Master      F  analyst      2      30
2  David         HR    Master      M  analyst      2      30


#### Find the employees whose title is associate and whose salary is above 55.

In [45]:
# your code here

employees['selected_employees'] = (employees['Title'] == 'associate') & (employees['Salary'] > 55)
selected_employees = employees[employees['selected_employees'] == True]
selected_employees

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary,selected_employees
7,Pedro,IT,Phd,M,associate,7,60,True


#### Group the employees by number of years of employment. What are the average salaries in each group?

In [46]:
# your code here
employees.groupby(['Years']).agg({'Salary':'mean'})

Unnamed: 0_level_0,Salary
Years,Unnamed: 1_level_1
1,35.0
2,38.333333
3,55.0
4,35.0
7,60.0
8,70.0


####  What is the average salary per title?

In [47]:
# your code here

employees.groupby(['Title']).agg({'Salary':'mean'})

Unnamed: 0_level_0,Salary
Title,Unnamed: 1_level_1
VP,70.0
analyst,32.5
associate,56.666667


In [50]:
selected_employees

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary,selected_employees
7,Pedro,IT,Phd,M,associate,7,60,True


####  Find the salary quartiles.


In [57]:
employees_for_quartiles = employees.drop(['Name','Department','Education','Gender','Title','selected_employees','Years'], axis = 1)

In [58]:
# your code here

# calculated the lower quartile -> the position where 25% of my data "has passed"
q1 = np.quantile(employees_for_quartiles , 0.25)  #interpolation='midpoint')
print("the first quartile is", q1)
q2 = np.quantile(employees_for_quartiles , 0.50)
print("the second quartile is",q2)
q3 = np.quantile(employees_for_quartiles , 0.75)
print("the third quartile is", q3)

the first quartile is 35.0
the second quartile is 55.0
the third quartile is 60.0


#### Is the mean salary different per gender?

In [59]:
# your code here
employees.groupby(['Gender']).agg({'Salary':'mean'})

Unnamed: 0_level_0,Salary
Gender,Unnamed: 1_level_1
F,47.5
M,50.0


#### Find the minimum, mean and maximum of all numeric columns for each company department.



In [63]:
# your code here

employees.groupby(['Department']).agg({'Salary':['mean','max','min']})

Unnamed: 0_level_0,Salary,Salary,Salary
Unnamed: 0_level_1,mean,max,min
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
HR,45.0,70,30
IT,48.75,70,30
Sales,55.0,55,55


In [64]:
employees.groupby(['Department']).agg({'Years':['mean','max','min']})

Unnamed: 0_level_0,Years,Years,Years
Unnamed: 0_level_1,mean,max,min
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
HR,4.666667,8,2
IT,4.5,8,1
Sales,2.5,3,2


#### Bonus Question:  for each department, compute the difference between the maximum and the minimum salary.
**Hint**: try using `agg` or `apply` combined with `lambda` functions.

In [None]:
# your code here

# [ONLY ONE MANDATORY] Challenge 3
#### Open the `Orders` dataset. Name your dataset `orders`.

In [None]:
# your code here

#### Explore your dataset by looking at the data types and summary statistics. Comment your results.

In [None]:
# your code here

In [None]:
"""
your comments here
"""

####  What is the average purchase price?

In [None]:
# your code here

#### What are the highest and lowest purchase prices? 

In [None]:
# your code here

#### Select all the customers from Spain.
**Hint**: Remember that you are not asked to find orders from Spain but customers. A customer might have more than one order associated. 

In [None]:
# your code here

#### How many customers do we have in Spain?

In [None]:
# your code here

#### Select all the customers who have bought more than 50 items.
**Hint**: Remember that you are not asked to find orders with more than 50 items but customers who bought more than 50 items. A customer with two orders of 30 items each should appear in the selection.

In [None]:
# your code here

#### Select orders from Spain that include more than 50 items.

In [None]:
# your code here

#### Select all free orders.

In [None]:
# your code here

#### Select all orders whose description starts with `lunch bag`.
**Hint**: use string functions.

In [None]:
# your code here

#### Select all `lunch bag` orders made in 2011.

In [None]:
# your code here

#### Show the frequency distribution of the amount spent in Spain.

In [None]:
# your code here

#### Select all orders made in the month of August.

In [None]:
# your code here

#### Find the number of orders made by each country in the month of August.
**Hint**: Use value_counts().

In [None]:
# your code here

#### What's the  average amount of money spent by country?

In [None]:
# your code here

#### What's the most expensive item?

In [None]:
# your code here

#### What is the average amount spent per year?

In [None]:
# your code here