# Subsetting and Descriptive Stats

## Before your start:
   - Remember that you just need to do one of the challenges.
   - Keep in mind that you need to use some of the functions you learned in the previous lessons.
   - All datasets are provided in IronHack's database.
   - Elaborate your codes and outputs as much as you can.
   - Try your best to answer the questions and complete the tasks and most importantly: enjoy the process!
   
#### Import all the necessary libraries here:

In [1]:
import pandas as pd
import numpy as np

# [ONLY ONE MANDATORY] Challenge 1
#### In this challenge we will use the `Temp_States`  dataset. 

#### First import it into a dataframe called `temp`.

In [2]:
temp = pd.read_csv('Temp_States.csv')

#### Print `temp`.

In [3]:
temp

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333
3,Hartford,Connecticut,17.222222
4,Bridgeport,Connecticut,14.444444
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Explore the data types of the *temp* dataframe. What types of data do we have? Comment your result.

In [4]:
temp.dtypes

City            object
State           object
Temperature    float64
dtype: object

In [5]:
### Columns ['City'] and ['State'] take string objects for the names of the locations, and the ['Temperature'] comes in as a float(numeric) value

#### Select the rows where state is New York.

In [6]:
newyork = temp[temp['State'] == 'New York']
newyork

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333


#### What is the average temperature of cities in New York?

In [7]:
newyork['Temperature'].mean()

10.74074074074074

#### Which states and cities have a temperature above 15 degrees Celsius?

In [8]:
fifteen = temp[temp['Temperature'] > 15]

fifteen

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
3,Hartford,Connecticut,17.222222
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Now, return only the cities that have a temperature above 15 degrees Celsius.

In [9]:
list(fifteen['City'])

['NYC', 'Hartford', 'Treton', 'Newark']

#### Which cities have a temperature above 15 degrees Celcius and below 20 degrees Celsius?

**Hint**: First, write the condition. Then, select the rows.

In [10]:
condition = (temp["Temperature"] > 15) & (temp["Temperature"] < 20)

temp[condition]

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
3,Hartford,Connecticut,17.222222


#### Find the mean and standard deviation of the temperature of each state.

In [11]:
#First we calculate the mean

mean = temp.groupby(["State"])['Temperature'].mean()
print(mean)

# And then the standard deviation

std = temp.groupby(["State"])['Temperature'].std()
print(std)

State
Connecticut    15.833333
New Jersey     21.111111
New York       10.740741
Name: Temperature, dtype: float64
State
Connecticut    1.964186
New Jersey     1.571348
New York       8.133404
Name: Temperature, dtype: float64


# [ONLY ONE MANDATORY]  Challenge 2

#### Load the `employees` dataset into a dataframe. Call the dataframe `employees`.

In [12]:
employees = pd.read_csv('Employee.csv')
employees

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
6,Carlos,IT,Master,M,VP,8,70
7,Pedro,IT,Phd,M,associate,7,60
8,Ana,HR,Master,F,VP,8,70


#### Explore the data types of the `employees` dataframe. Comment your results.

In [13]:
employees.dtypes

Name          object
Department    object
Education     object
Gender        object
Title         object
Years          int64
Salary         int64
dtype: object

In [14]:
#Columns for Name, Department, Education, Gender and Title take strings to describe their variables, whereas Years and Salary are described as integers (numeric values)

#### What's the average salary in this company?

In [15]:
employees['Salary'].mean()

48.888888888888886

#### What's the highest salary?

In [16]:
employees['Salary'].max()

70

#### What's the lowest salary?

In [17]:
employees['Salary'].min()

30

#### Who are the employees with the lowest salary?

In [18]:
employees[employees['Salary'] == employees['Salary'].min()]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


#### Find all the information about an employee called David.

In [19]:
david = employees[employees['Name'] == 'David']
david

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
2,David,HR,Master,M,analyst,2,30


#### Could you return only David's salary?

In [20]:
david['Salary']

2    30
Name: Salary, dtype: int64

#### Print all the rows where job title is associate.

In [21]:
associate = employees[employees['Title'] == 'associate']
associate

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
7,Pedro,IT,Phd,M,associate,7,60


#### Print the first 3 rows of your dataframe.
**Tip**: There are 2 ways to do it. Do it both ways.

In [22]:
employees[:3]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


In [23]:
employees.head(3)

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


#### Find the employees whose title is associate and whose salary is above 55.

In [24]:
employees[(employees['Title'] == 'associate') & (employees['Salary'] >55)]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
7,Pedro,IT,Phd,M,associate,7,60


#### Group the employees by number of years of employment. What are the average salaries in each group?

In [25]:
employees.groupby('Years').mean()

Unnamed: 0_level_0,Salary
Years,Unnamed: 1_level_1
1,35.0
2,38.333333
3,55.0
4,35.0
7,60.0
8,70.0


####  What is the average salary per title?

In [26]:
employees.groupby('Title').mean()

Unnamed: 0_level_0,Years,Salary
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
VP,8.0,70.0
analyst,2.25,32.5
associate,4.0,56.666667


####  Find the salary quartiles.


In [27]:
quarter = employees['Salary'].quantile(q=0.25)
half = employees['Salary'].median()
threequarters = employees['Salary'].quantile(q=0.75)
highest = employees['Salary'].max()

print("The distribution of salaries in the company is as it follows: the first quartile stands at", quarter ,",half of the employees earn about", half,",the 75 quartile reaches", threequarters,"and the highest salaries can reach",highest)

The distribution of salaries in the company is as it follows: the first quartile stands at 35.0 ,half of the employees earn about 55.0 ,the 75 quartile reaches 60.0 and the highest salaries can reach 70


#### Is the mean salary different per gender?

In [28]:
employees.groupby('Gender').mean()

#Yes (but hopefully because of other factors like experience and role!)

Unnamed: 0_level_0,Years,Salary
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,4.0,47.5
M,4.2,50.0


#### Find the minimum, mean and maximum of all numeric columns for each company department.



In [29]:
department = employees.groupby('Department')['Salary'].describe()
department

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
HR,3.0,45.0,21.794495,30.0,32.5,35.0,52.5,70.0
IT,4.0,48.75,19.31105,30.0,33.75,47.5,62.5,70.0
Sales,2.0,55.0,0.0,55.0,55.0,55.0,55.0,55.0


#### Bonus Question:  for each department, compute the difference between the maximum and the minimum salary.
**Hint**: try using `agg` or `apply` combined with `lambda` functions.

In [30]:
department['difference'] = department['max']-department['min']
department

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max,difference
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
HR,3.0,45.0,21.794495,30.0,32.5,35.0,52.5,70.0,40.0
IT,4.0,48.75,19.31105,30.0,33.75,47.5,62.5,70.0,40.0
Sales,2.0,55.0,0.0,55.0,55.0,55.0,55.0,55.0,0.0


# [ONLY ONE MANDATORY] Challenge 3
#### Open the `Orders` dataset. Name your dataset `orders`.

#### Explore your dataset by looking at the data types and summary statistics. Comment your results.

####  What is the average purchase price?

#### What are the highest and lowest purchase prices? 

#### Select all the customers from Spain.
**Hint**: Remember that you are not asked to find orders from Spain but customers. A customer might have more than one order associated. 

#### How many customers do we have in Spain?

#### Select all the customers who have bought more than 50 items.
**Hint**: Remember that you are not asked to find orders with more than 50 items but customers who bought more than 50 items. A customer with two orders of 30 items each should appear in the selection.

#### Select orders from Spain that include more than 50 items.

#### Select all free orders.

In [31]:
# your code here

#### Select all orders whose description starts with `lunch bag`.
**Hint**: use string functions.

In [32]:
# your code here

#### Select all `lunch bag` orders made in 2011.

In [33]:
# your code here

#### Show the frequency distribution of the amount spent in Spain.

In [34]:
# your code here

#### Select all orders made in the month of August.

In [35]:
# your code here

#### Find the number of orders made by each country in the month of August.
**Hint**: Use value_counts().

In [36]:
# your code here

#### What's the  average amount of money spent by country?

In [37]:
# your code here

#### What's the most expensive item?

In [38]:
# your code here

#### What is the average amount spent per year?

In [39]:
# your code here