# Subsetting and Descriptive Stats

## Before your start:
   - Remember that you just need to do one of the challenges.
   - Keep in mind that you need to use some of the functions you learned in the previous lessons.
   - All datasets are provided in IronHack's database.
   - Elaborate your codes and outputs as much as you can.
   - Try your best to answer the questions and complete the tasks and most importantly: enjoy the process!
   
#### Import all the necessary libraries here:

In [33]:
import pandas as pd
import numpy as np

# [ONLY ONE MANDATORY] Challenge 1
#### In this challenge we will use the `Temp_States`  dataset. 

#### First import it into a dataframe called `temp`.

In [2]:
temp = pd.read_csv('Temp_States.csv')

#### Print `temp`.

In [5]:
temp

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333
3,Hartford,Connecticut,17.222222
4,Bridgeport,Connecticut,14.444444
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Explore the data types of the *temp* dataframe. What types of data do we have? Comment your result.

In [6]:
temp.dtypes

City            object
State           object
Temperature    float64
dtype: object

In [None]:
"""
We have two columns of categorical datas (City name and state) and one with numeric data corresponding 
to the temperature in those places.
"""

#### Select the rows where state is New York.

In [9]:
nys = temp.loc[(temp['State'] == 'New York')]
nys

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
1,Albany,New York,9.444444
2,Buffalo,New York,3.333333


#### What is the average temperature of cities in New York?

In [11]:
nys['Temperature'].mean()

10.74074074074074

#### Which states and cities have a temperature above 15 degrees Celsius?

In [12]:
temp.loc[(temp['Temperature'] > 15)]

Unnamed: 0,City,State,Temperature
0,NYC,New York,19.444444
3,Hartford,Connecticut,17.222222
5,Treton,New Jersey,22.222222
6,Newark,New Jersey,20.0


#### Now, return only the cities that have a temperature above 15 degrees Celsius.

In [16]:
temp.loc[(temp['Temperature'] > 15)]['City']

0         NYC
3    Hartford
5      Treton
6      Newark
Name: City, dtype: object

#### Which cities have a temperature above 15 degrees Celcius and below 20 degrees Celsius?

**Hint**: First, write the condition. Then, select the rows.

In [17]:
temp.loc[(temp['Temperature'] > 15) & (temp['Temperature'] < 20)]['City']

0         NYC
3    Hartford
Name: City, dtype: object

#### Find the mean and standard deviation of the temperature of each state.

In [30]:
mean_temp = temp.groupby('State', as_index = False).agg({'Temperature':'mean'}).rename(columns = {'Temperature':'Average Temperature'})
mean_temp

Unnamed: 0,State,Average Temperature
0,Connecticut,15.833333
1,New Jersey,21.111111
2,New York,10.740741


In [31]:
std_temp = temp.groupby('State', as_index = False).agg({'Temperature':'std'}).rename(columns = {'Temperature':'STD Temperature'})
std_temp

Unnamed: 0,State,STD Temperature
0,Connecticut,1.964186
1,New Jersey,1.571348
2,New York,8.133404


# [ONLY ONE MANDATORY]  Challenge 2

#### Load the `employees` dataset into a dataframe. Call the dataframe `employees`.

In [41]:
employees = pd.read_csv('Employee.csv')
employees

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
6,Carlos,IT,Master,M,VP,8,70
7,Pedro,IT,Phd,M,associate,7,60
8,Ana,HR,Master,F,VP,8,70


#### Explore the data types of the `employees` dataframe. Comment your results.

In [37]:
employees.dtypes

Name          object
Department    object
Education     object
Gender        object
Title         object
Years          int64
Salary         int64
dtype: object

In [None]:
"""
The dataframe contains 5 columns of categorical datas and 2 of numeric data
"""

#### What's the average salary in this company?

In [38]:
employees['Salary'].mean()

48.888888888888886

#### What's the highest salary?

In [39]:
employees['Salary'].max()

70

#### What's the lowest salary?

In [40]:
employees['Salary'].min()

30

#### Who are the employees with the lowest salary?

In [45]:
employees.loc[(employees['Salary'] == employees['Salary'].min())]['Name']

1    Maria
2    David
Name: Name, dtype: object

#### Find all the information about an employee called David.

In [46]:
employees.loc[(employees['Name'] == 'David')]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
2,David,HR,Master,M,analyst,2,30


#### Could you return only David's salary?

In [47]:
employees.loc[(employees['Name'] == 'David')]['Salary']

2    30
Name: Salary, dtype: int64

#### Print all the rows where job title is associate.

In [48]:
employees.loc[(employees['Title'] == 'associate')]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
7,Pedro,IT,Phd,M,associate,7,60


#### Print the first 3 rows of your dataframe.
**Tip**: There are 2 ways to do it. Do it both ways.

In [49]:
# Method 1
employees[:3]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


In [52]:
# Method 2
employees.iloc[:3]

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30


#### Find the employees whose title is associate and whose salary is above 55.

In [51]:
employees.loc[(employees['Title'] == 'associate') & (employees['Salary'] > 55)]['Name']

7    Pedro
Name: Name, dtype: object

#### Group the employees by number of years of employment. What are the average salaries in each group?

In [54]:
employees.groupby('Years', as_index = False).agg({'Salary':'mean'}).rename(columns = {'Salary':'Average Salary'})

Unnamed: 0,Years,Average Salary
0,1,35.0
1,2,38.333333
2,3,55.0
3,4,35.0
4,7,60.0
5,8,70.0


####  What is the average salary per title?

In [55]:
employees.groupby('Title', as_index = False).agg({'Salary':'mean'}).rename(columns = {'Salary':'Average Salary'})

Unnamed: 0,Title,Average Salary
0,VP,70.0
1,analyst,32.5
2,associate,56.666667


####  Find the salary quartiles.


In [60]:
stats = employees['Salary'].describe()
q1 = stats.loc['25%']
q3 = stats.loc['75%']

#### Is the mean salary different per gender?

In [56]:
employees.groupby('Gender', as_index = False).agg({'Salary':'mean'}).rename(columns = {'Salary':'Average Salary'})

Unnamed: 0,Gender,Average Salary
0,F,47.5
1,M,50.0


#### Find the minimum, mean and maximum of all numeric columns for each company department.



In [64]:
employees.groupby('Department').describe()

Unnamed: 0_level_0,Years,Years,Years,Years,Years,Years,Years,Years,Salary,Salary,Salary,Salary,Salary,Salary,Salary,Salary
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
HR,3.0,4.666667,3.05505,2.0,3.0,4.0,6.0,8.0,3.0,45.0,21.794495,30.0,32.5,35.0,52.5,70.0
IT,4.0,4.5,3.511885,1.0,1.75,4.5,7.25,8.0,4.0,48.75,19.31105,30.0,33.75,47.5,62.5,70.0
Sales,2.0,2.5,0.707107,2.0,2.25,2.5,2.75,3.0,2.0,55.0,0.0,55.0,55.0,55.0,55.0,55.0


#### Bonus Question:  for each department, compute the difference between the maximum and the minimum salary.
**Hint**: try using `agg` or `apply` combined with `lambda` functions.

In [None]:
# your code here

# [ONLY ONE MANDATORY] Challenge 3
#### Open the `Orders` dataset. Name your dataset `orders`.

In [65]:
orders = pd.read_csv('Orders.zip')
orders

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.30
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.00
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397919,541904,581587,22613,2011,12,5,12,pack of 20 spaceboy napkins,12,2011-12-09 12:50:00,0.85,12680,France,10.20
397920,541905,581587,22899,2011,12,5,12,children's apron dolly girl,6,2011-12-09 12:50:00,2.10,12680,France,12.60
397921,541906,581587,23254,2011,12,5,12,childrens cutlery dolly girl,4,2011-12-09 12:50:00,4.15,12680,France,16.60
397922,541907,581587,23255,2011,12,5,12,childrens cutlery circus parade,4,2011-12-09 12:50:00,4.15,12680,France,16.60


#### Explore your dataset by looking at the data types and summary statistics. Comment your results.

In [67]:
orders.dtypes

Unnamed: 0        int64
InvoiceNo         int64
StockCode        object
year              int64
month             int64
day               int64
hour              int64
Description      object
Quantity          int64
InvoiceDate      object
UnitPrice       float64
CustomerID        int64
Country          object
amount_spent    float64
dtype: object

In [69]:
orders.describe()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,year,month,day,hour,Quantity,UnitPrice,CustomerID,amount_spent
count,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0
mean,278465.221859,560617.126645,2010.934259,7.612537,3.614555,12.728247,13.021823,3.116174,15294.315171,22.394749
std,152771.368303,13106.167695,0.247829,3.416527,1.928274,2.273535,180.42021,22.096788,1713.169877,309.055588
min,0.0,536365.0,2010.0,1.0,1.0,6.0,1.0,0.0,12346.0,0.0
25%,148333.75,549234.0,2011.0,5.0,2.0,11.0,2.0,1.25,13969.0,4.68
50%,284907.5,561893.0,2011.0,8.0,3.0,13.0,6.0,1.95,15159.0,11.8
75%,410079.25,572090.0,2011.0,11.0,5.0,14.0,12.0,3.75,16795.0,19.8
max,541908.0,581587.0,2011.0,12.0,7.0,20.0,80995.0,8142.75,18287.0,168469.6


In [None]:
"""
your comments here
"""

####  What is the average purchase price?

In [70]:
orders['amount_spent'].mean()

22.39474850474768

#### What are the highest and lowest purchase prices? 

In [71]:
orders['amount_spent'].max()

168469.6

In [72]:
orders['amount_spent'].min()

0.0

#### Select all the customers from Spain.
**Hint**: Remember that you are not asked to find orders from Spain but customers. A customer might have more than one order associated. 

In [None]:
# your code here

#### How many customers do we have in Spain?

In [None]:
# your code here

#### Select all the customers who have bought more than 50 items.
**Hint**: Remember that you are not asked to find orders with more than 50 items but customers who bought more than 50 items. A customer with two orders of 30 items each should appear in the selection.

In [None]:
# your code here

#### Select orders from Spain that include more than 50 items.

In [None]:
# your code here

#### Select all free orders.

In [None]:
# your code here

#### Select all orders whose description starts with `lunch bag`.
**Hint**: use string functions.

In [None]:
# your code here

#### Select all `lunch bag` orders made in 2011.

In [None]:
# your code here

#### Show the frequency distribution of the amount spent in Spain.

In [None]:
# your code here

#### Select all orders made in the month of August.

In [None]:
# your code here

#### Find the number of orders made by each country in the month of August.
**Hint**: Use value_counts().

In [None]:
# your code here

#### What's the  average amount of money spent by country?

In [None]:
# your code here

#### What's the most expensive item?

In [None]:
# your code here

#### What is the average amount spent per year?

In [None]:
# your code here