# Subsetting and Descriptive Stats

## Before your start:
   - Remember that you just need to do one of the challenges.
   - Keep in mind that you need to use some of the functions you learned in the previous lessons.
   - All datasets are provided in IronHack's database.
   - Elaborate your codes and outputs as much as you can.
   - Try your best to answer the questions and complete the tasks and most importantly: enjoy the process!
   
#### Import all the necessary libraries here:

In [2]:
# import libraries here
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.stats as sp
import pandas as pd


# [ONLY ONE MANDATORY] Challenge 1
#### In this challenge we will use the `Temp_States`  dataset. 

#### First import it into a dataframe called `temp`.

In [2]:
# your code here
temp = pd.read_csv("Temp_States.csv")

#### Print `temp`.

In [3]:
# your code here
print(temp)

         City        State  Temperature
0         NYC     New York    19.444444
1      Albany     New York     9.444444
2     Buffalo     New York     3.333333
3    Hartford  Connecticut    17.222222
4  Bridgeport  Connecticut    14.444444
5      Treton   New Jersey    22.222222
6      Newark   New Jersey    20.000000


#### Explore the data types of the *temp* dataframe. What types of data do we have? Comment your result.

In [15]:
# your code here
columns = temp.columns
[type(temp[i][0]) for i in columns]

[str, str, numpy.float64]

In [None]:
"""
string, string (both text), and float (number with decimal)
"""

#### Select the rows where state is New York.

In [32]:
# your code here

new_york = temp[temp["State"]=="New York"]
new_york

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333


#### What is the average temperature of cities in New York?

In [35]:
# your code here
avg_temp = new_york["Temperature"].sum()/new_york["Temperature"].count()
avg_temp

10.74074074074074

#### Which states and cities have a temperature above 15 degrees Celsius?

In [37]:
# your code here
temp[temp["Temperature"] >= 15.0]

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
3,Hartford,Connecticut,17.222222
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Now, return only the cities that have a temperature above 15 degrees Celsius.

In [38]:
# your code here
temp["City"][temp["Temperature"]>=15.0]

0         NYC
3    Hartford
5      Treton
6      Newark
Name: City, dtype: object

#### Which cities have a temperature above 15 degrees Celcius and below 20 degrees Celsius?

**Hint**: First, write the condition. Then, select the rows.

In [44]:
# your code here
temp["City"][(temp["Temperature"]>=15.0) & (temp["Temperature"]<=20.0)]

0         NYC
3    Hartford
6      Newark
Name: City, dtype: object

In [47]:
temp

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333
3,Hartford,Connecticut,17.222222
4,Bridgeport,Connecticut,14.444444
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Find the mean and standard deviation of the temperature of each state.

In [54]:
# your code here
temp_state_mean = temp.groupby("State", as_index = False).agg({"Temperature": "mean"})
print(temp_state_mean)

temp_state_std = temp.groupby("State", as_index = False).agg({"Temperature": "std"})
print(temp_state_std)

         State  Temperature
0  Connecticut    15.833333
1   New Jersey    21.111111
2     New York    10.740741
         State  Temperature
0  Connecticut     1.964186
1   New Jersey     1.571348
2     New York     8.133404


# [ONLY ONE MANDATORY]  Challenge 2

#### Load the `employees` dataset into a dataframe. Call the dataframe `employees`.

In [5]:
# your code here
employees = pd.read_csv("Employee.csv")

#### Explore the data types of the `employees` dataframe. Comment your results.

In [6]:
# your code here
employees.head()

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55


In [7]:
employees.shape

(9, 7)

In [8]:
employees.describe()

Unnamed: 0,Years,Salary
count,9.0,9.0
mean,4.111111,48.888889
std,2.803767,16.541194
min,1.0,30.0
25%,2.0,35.0
50%,3.0,55.0
75%,7.0,60.0
max,8.0,70.0


In [None]:
"""
9 employees, seperated in 7 columns,
"""

#### What's the average salary in this company?

In [10]:
# your code here
avg_salary = employees["Salary"].sum()/employees["Salary"].count()
avg_salary

48.888888888888886

#### What's the highest salary?

In [11]:
# your code here
employees["Salary"].max()

70

#### What's the lowest salary?

In [12]:
# your code here
employees["Salary"].min()

30

#### Who are the employees with the lowest salary?

In [16]:
# your code here
employees[employees["Salary"] == employees["Salary"].min()]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


#### Find all the information about an employee called David.

In [26]:
# your code here

employees[employees["Name"] =="David"]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
2,David,HR,Master,M,analyst,2,30


#### Could you return only David's salary?

In [27]:
# your code here
employees["Salary"][employees["Name"] =="David"]

2    30
Name: Salary, dtype: int64

#### Print all the rows where job title is associate.

In [28]:
# your code here
employees[employees["Title"]=="associate"]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
7,Pedro,IT,Phd,M,associate,7,60


#### Print the first 3 rows of your dataframe.
**Tip**: There are 2 ways to do it. Do it both ways.

In [29]:
# Method 1
# your code here
employees.head(3)

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


In [31]:
# Method 2
# your code here
employees[0:3]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


#### Find the employees whose title is associate and whose salary is above 55.

In [33]:
# your code here

employees[(employees["Title"]=="associate") & (employees["Salary"]>55)]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
7,Pedro,IT,Phd,M,associate,7,60


#### Group the employees by number of years of employment. What are the average salaries in each group?

In [34]:
# your code here
employees.groupby("Years", as_index=False).agg({"Salary":"mean"})

Unnamed: 0,Years,Salary
0,1,35.0
1,2,38.333333
2,3,55.0
3,4,35.0
4,7,60.0
5,8,70.0


####  What is the average salary per title?

In [35]:
# your code here
employees.groupby("Title", as_index=False).agg({"Salary":"mean"})

Unnamed: 0,Title,Salary
0,VP,70.0
1,analyst,32.5
2,associate,56.666667


####  Find the salary quartiles.


In [39]:
# your code here
q1 = np.percentile(employees.Salary, 25)
q2 = np.percentile(employees.Salary, 50)
q3 = np.percentile(employees.Salary, 75)

print(q1,q2,q3)

35.0 55.0 60.0


#### Is the mean salary different per gender?

In [40]:
# your code here
employees.groupby("Gender", as_index=False).agg({"Salary":"mean"})

Unnamed: 0,Gender,Salary
0,F,47.5
1,M,50.0


#### Find the minimum, mean and maximum of all numeric columns for each company department.



In [50]:
# your code here
salary_min = employees.groupby("Department", as_index=False).agg({"Salary":"min"})
salary_max = employees.groupby("Department", as_index=False).agg({"Salary":"max" })
salary_mean = employees.groupby("Department", as_index=False).agg({"Salary":"mean" })

years_min = employees.groupby("Department", as_index=False).agg({"Years":"min"})
years_max = employees.groupby("Department", as_index=False).agg({"Years":"max" })
years_mean = employees.groupby("Department", as_index=False).agg({"Years":"mean" })

print("Salary Min: ", salary_min,"\n\nSalary Max: ",salary_max,"\n\nSalary Mean: ", salary_mean,"\n\nYears min:", years_min,"\n\nYears max:", years_max,"\n\nYears mean: ",years_mean)


Salary Min:    Department  Salary
0         HR      30
1         IT      30
2      Sales      55 

Salary Max:    Department  Salary
0         HR      70
1         IT      70
2      Sales      55 

Salary Mean:    Department  Salary
0         HR   45.00
1         IT   48.75
2      Sales   55.00 

Years min:   Department  Years
0         HR      2
1         IT      1
2      Sales      2 

Years max:   Department  Years
0         HR      8
1         IT      8
2      Sales      3 

Years mean:    Department     Years
0         HR  4.666667
1         IT  4.500000
2      Sales  2.500000


#### Bonus Question:  for each department, compute the difference between the maximum and the minimum salary.
**Hint**: try using `agg` or `apply` combined with `lambda` functions.

In [66]:
# your code here
#department = pd.unique(employees["Department"])
#department

diff_department= employees.groupby("Department", as_index=False).agg([lambda x : x.max() - x.min()])
diff_department

Unnamed: 0_level_0,Years,Salary
Unnamed: 0_level_1,<lambda>,<lambda>
Department,Unnamed: 1_level_2,Unnamed: 2_level_2
HR,6,40
IT,7,40
Sales,1,0


# [ONLY ONE MANDATORY] Challenge 3
#### Open the `Orders` dataset. Name your dataset `orders`.

In [None]:
# your code here

#### Explore your dataset by looking at the data types and summary statistics. Comment your results.

In [None]:
# your code here

In [None]:
"""
your comments here
"""

####  What is the average purchase price?

In [None]:
# your code here

#### What are the highest and lowest purchase prices? 

In [None]:
# your code here

#### Select all the customers from Spain.
**Hint**: Remember that you are not asked to find orders from Spain but customers. A customer might have more than one order associated. 

In [None]:
# your code here

#### How many customers do we have in Spain?

In [None]:
# your code here

#### Select all the customers who have bought more than 50 items.
**Hint**: Remember that you are not asked to find orders with more than 50 items but customers who bought more than 50 items. A customer with two orders of 30 items each should appear in the selection.

In [None]:
# your code here

#### Select orders from Spain that include more than 50 items.

In [None]:
# your code here

#### Select all free orders.

In [None]:
# your code here

#### Select all orders whose description starts with `lunch bag`.
**Hint**: use string functions.

In [None]:
# your code here

#### Select all `lunch bag` orders made in 2011.

In [None]:
# your code here

#### Show the frequency distribution of the amount spent in Spain.

In [None]:
# your code here

#### Select all orders made in the month of August.

In [None]:
# your code here

#### Find the number of orders made by each country in the month of August.
**Hint**: Use value_counts().

In [None]:
# your code here

#### What's the  average amount of money spent by country?

In [None]:
# your code here

#### What's the most expensive item?

In [None]:
# your code here

#### What is the average amount spent per year?

In [None]:
# your code here