# Subsetting and Descriptive Stats

## Before your start:
   - Remember that you just need to do one of the challenges.
   - Keep in mind that you need to use some of the functions you learned in the previous lessons.
   - All datasets are provided in IronHack's database.
   - Elaborate your codes and outputs as much as you can.
   - Try your best to answer the questions and complete the tasks and most importantly: enjoy the process!
   
#### Import all the necessary libraries here:

In [1]:
# import libraries here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# [ONLY ONE MANDATORY] Challenge 1
#### In this challenge we will use the `Temp_States`  dataset. 

#### First import it into a dataframe called `temp`.

In [6]:
temp=pd.read_csv('Temp_States.csv')

#### Print `temp`.

In [7]:
display(temp)

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333
3,Hartford,Connecticut,17.222222
4,Bridgeport,Connecticut,14.444444
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Explore the data types of the *temp* dataframe. What types of data do we have? Comment your result.

In [11]:
temp.dtypes

City            object
State           object
Temperature    float64
dtype: object

In [None]:
"""
City and State are data type "object", so the columns contain text
Temp is float64, which means it contains numerical values with decimals
"""

#### Select the rows where state is New York.

In [13]:
display(temp[temp['State']=='New York'])

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333


#### What is the average temperature of cities in New York?

In [19]:
#Use of loc to access the rows
average_temperature_ny = temp.loc[temp['State'] == 'New York', 'Temperature'].mean()
print(average_temperature_ny)

10.74074074074074


#### Which states and cities have a temperature above 15 degrees Celsius?

In [32]:
# your code here
temperature_above_15=temp[temp['Temperature']>15][['City','State']]
temperature_above_15

Unnamed: 0,City,State
0,NYC,New York
3,Hartford,Connecticut
5,Treton,New Jersey
6,Newark,New Jersey


#### Now, return only the cities that have a temperature above 15 degrees Celsius.

In [34]:
# your code here
cities_temperature_above_15=temp[temp['Temperature']>15]['City']
cities_temperature_above_15

0         NYC
3    Hartford
5      Treton
6      Newark
Name: City, dtype: object

#### Which cities have a temperature above 15 degrees Celcius and below 20 degrees Celsius?

**Hint**: First, write the condition. Then, select the rows.

In [35]:
# your code here
temp[(temp['Temperature']>15) & (temp['Temperature']<20)]['City']

0         NYC
3    Hartford
Name: City, dtype: object

#### Find the mean and standard deviation of the temperature of each state.

In [39]:
# your code here
temp.groupby('State').agg({'Temperature':['mean','std']})

Unnamed: 0_level_0,Temperature,Temperature
Unnamed: 0_level_1,mean,std
State,Unnamed: 1_level_2,Unnamed: 2_level_2
Connecticut,15.833333,1.964186
New Jersey,21.111111,1.571348
New York,10.740741,8.133404


In [42]:
#Another way 
result = temp.groupby('State')['Temperature'].agg(['mean', 'std'])
result

Unnamed: 0_level_0,mean,std
State,Unnamed: 1_level_1,Unnamed: 2_level_1
Connecticut,15.833333,1.964186
New Jersey,21.111111,1.571348
New York,10.740741,8.133404


# [ONLY ONE MANDATORY]  Challenge 2

#### Load the `employees` dataset into a dataframe. Call the dataframe `employees`.

In [46]:
# your code here
employees=pd.read_csv('Employee.csv')

employees.head()

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55


#### Explore the data types of the `employees` dataframe. Comment your results.

In [47]:
employees.dtypes

Name          object
Department    object
Education     object
Gender        object
Title         object
Years          int64
Salary         int64
dtype: object

In [50]:
"""
Overall, the data types seem appropriate for the respective columns, with text data stored as 'object' 
and numerical data (years and salary) stored as int64.
"""

"\nOverall, the data types seem appropriate for the respective columns, with text data stored as 'object' \nand numerical data (years and salary) stored as int64.\n"

#### What's the average salary in this company?

In [51]:
# your code here
employees['Salary'].mean()

48.888888888888886

#### What's the highest salary?

In [52]:
# your code here
employees['Salary'].max()

70

#### What's the lowest salary?

In [53]:
# your code here
employees['Salary'].min()

30

#### Who are the employees with the lowest salary?

In [57]:
# your code here
min_salary = employees['Salary'].min()
lowest_salary_employee = employees[employees['Salary'] == min_salary]


lowest_salary_employee['Name']

1    Maria
2    David
Name: Name, dtype: object

#### Find all the information about an employee called David.

In [59]:
# your code here
employees[employees['Name']=='David']

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
2,David,HR,Master,M,analyst,2,30


#### Could you return only David's salary?

In [60]:
# your code here
employees[employees['Name']=='David']['Salary']

2    30
Name: Salary, dtype: int64

#### Print all the rows where job title is associate.

In [67]:
# your code here
employees[employees['Title']=='associate']

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
7,Pedro,IT,Phd,M,associate,7,60


#### Print the first 3 rows of your dataframe.
**Tip**: There are 2 ways to do it. Do it both ways.

In [68]:
# Method 1
# your code here
display(employees.head(3))

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


In [70]:
# Method 2
# your code here
display(employees.loc[:2])

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


#### Find the employees whose title is associate and whose salary is above 55.

In [72]:
# your code here
employees[(employees['Title']=='associate')&(employees['Salary']>50)]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
7,Pedro,IT,Phd,M,associate,7,60


#### Group the employees by number of years of employment. What are the average salaries in each group?

In [73]:
# your code here
employees.groupby('Years').mean()

  employees.groupby('Years').mean()


Unnamed: 0_level_0,Salary
Years,Unnamed: 1_level_1
1,35.0
2,38.333333
3,55.0
4,35.0
7,60.0
8,70.0


####  What is the average salary per title?

In [78]:
# your code here
employees.groupby('Title').mean()['Salary']

  employees.groupby('Title').mean()['Salary']


Title
VP           70.000000
analyst      32.500000
associate    56.666667
Name: Salary, dtype: float64

####  Find the salary quartiles.


In [79]:
q1 = np.quantile(employees['Salary'], 0.25)#, interpolation='midpoint')
print("the first quartile is", q1)
q2 = np.quantile(employees['Salary'], 0.50)
print("the second quartile is",q2)
q3 = np.quantile(employees['Salary'], 0.75)
print("the third quartile is", q3)

the first quartile is 35.0
the second quartile is 55.0
the third quartile is 60.0


#### Is the mean salary different per gender?

In [81]:
# your code here
employees.groupby('Gender').agg({'Salary':'mean'})

Unnamed: 0_level_0,Salary
Gender,Unnamed: 1_level_1
F,47.5
M,50.0


In [None]:
#The men salary is higher in average

#### Find the minimum, mean and maximum of all numeric columns for each company department.



In [84]:
# your code here
employees.groupby('Department').agg(['min','mean','max'])

  employees.groupby('Department').agg(['min','mean','max'])


Unnamed: 0_level_0,Years,Years,Years,Salary,Salary,Salary
Unnamed: 0_level_1,min,mean,max,min,mean,max
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
HR,2,4.666667,8,30,45.0,70
IT,1,4.5,8,30,48.75,70
Sales,2,2.5,3,55,55.0,55


#### Bonus Question:  for each department, compute the difference between the maximum and the minimum salary.
**Hint**: try using `agg` or `apply` combined with `lambda` functions.

In [86]:
# your code here
dif_salary=employees.groupby('Department')['Salary'].agg(lambda x: x.max() - x.min())

print(dif_salary)


Department
HR       40
IT       40
Sales     0
Name: Salary, dtype: int64


# [ONLY ONE MANDATORY] Challenge 3
#### Open the `Orders` dataset. Name your dataset `orders`.

In [None]:
# your code here

#### Explore your dataset by looking at the data types and summary statistics. Comment your results.

In [None]:
# your code here

In [None]:
"""
your comments here
"""

####  What is the average purchase price?

In [None]:
# your code here

#### What are the highest and lowest purchase prices? 

In [None]:
# your code here

#### Select all the customers from Spain.
**Hint**: Remember that you are not asked to find orders from Spain but customers. A customer might have more than one order associated. 

In [None]:
# your code here

#### How many customers do we have in Spain?

In [None]:
# your code here

#### Select all the customers who have bought more than 50 items.
**Hint**: Remember that you are not asked to find orders with more than 50 items but customers who bought more than 50 items. A customer with two orders of 30 items each should appear in the selection.

In [None]:
# your code here

#### Select orders from Spain that include more than 50 items.

In [None]:
# your code here

#### Select all free orders.

In [None]:
# your code here

#### Select all orders whose description starts with `lunch bag`.
**Hint**: use string functions.

In [None]:
# your code here

#### Select all `lunch bag` orders made in 2011.

In [None]:
# your code here

#### Show the frequency distribution of the amount spent in Spain.

In [None]:
# your code here

#### Select all orders made in the month of August.

In [None]:
# your code here

#### Find the number of orders made by each country in the month of August.
**Hint**: Use value_counts().

In [None]:
# your code here

#### What's the  average amount of money spent by country?

In [None]:
# your code here

#### What's the most expensive item?

In [None]:
# your code here

#### What is the average amount spent per year?

In [None]:
# your code here