In this challenge we will walk you through a dataset on employees. We will tackle the challenge step by step, and give guidance on how to solve the issue.

## Import all libraries that are necessary

In [1]:
# your code here

import pandas as pd
import numpy as np


## Import and overview data

First import `Employee.csv` from the "subsetting" lab folder and print head to overview the data:

In [2]:
# your code here
df = pd.read_csv('Employee.csv')
df.head()

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55


Printing the head is not a useless routine. You should really look at the data set and understand what they are. No data analyst can successfully analyze the data without in-dpeth understanding of what each column is about. As we progress in this course, the data sets are becoming increasingly complex which requires you to inspect the data at the beginning then on the needed basis thoughout the problem-solving process.


**Our next problem is: find the minimum, mean, and maximum of all numeric columns for each Department.**

We will solve this step by step.

## Main Problem - Setting Expectations

We want to break down the problem into several sub problems:

**Sub Problem 1 - How to extract numeric columns from a data set?**

**Sub Problem 2 - How to calculate minimum, mean. and maximum?**

**Sub Problem 3 - How to perform calculations for each Department?**

If we figure out each of the sub problems above, we have found the solution for our main problem.

Next let's tackcle each sub problem.

## Main Problem - Collecting Information

This step is the problem-solving process of the main problem in which we will solve each of the three sub problems.

### Sub Problem 1

#### Setting Expectations

**Define problem: How to extract numeric columns from a data set?**

#### Collecting Information

This was already covered in a previous lesson by using `dtypes`. So let's print out all numeric columns:

In [3]:
df

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55
5,Eva,Sales,Bachelor,F,associate,2,55
6,Carlos,IT,Master,M,VP,8,70
7,Pedro,IT,Phd,M,associate,7,60
8,Ana,HR,Master,F,VP,8,70


In [4]:
# enter your code here
df.dtypes

Name          object
Department    object
Education     object
Gender        object
Title         object
Years          int64
Salary         int64
dtype: object

You should have seen:
    
```
Name          object
Department    object
Education     object
Gender        object
Title         object
Years          int64
Salary         int64
dtype: object
```

#### Reacting to Data

You found `Years` and `Salary` are the numeric columns we need to extract. So we can potentially use a subset of our dataframe which contains only these two columns.

In [5]:
# your code here
df[['Years', 'Salary']]

Unnamed: 0,Years,Salary
0,1,35
1,2,30
2,2,30
3,4,35
4,3,55
5,2,55
6,8,70
7,7,60
8,8,70


But instead of hardcoding the column names in the solution, a better approach is to define a Python function that dynamically returns all numeric columns. You will be able to re-use this function in your future works. Also, if the data set is huge and it contains hundreds of numeric columns, it is impossible to manually select them.

#### Revising Expectations

**Define new problem: How to *dynamically* extract numeric columns from a data set?**

#### Collecting Information

This was not covered in the lesson. So we need to [google the answer](https://www.google.com/search?q=pandas+dataframe+get+all+numeric+columns).

After finding the answer, write the function called get_numeric_cols below, which will select only the numeric columns from the dataframe.

In [6]:
#STUPID FUNCTION

numeric_columns=[]
numeric_dtypes= ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

def get_numeric_cols (allcolumns):
    list=[]
    for x in allcolumns:
        if df[x].dtypes in numeric_dtypes:
            list.append(x)
    return df[list]

get_numeric_cols(df.columns.tolist())




Unnamed: 0,Years,Salary
0,1,35
1,2,30
2,2,30
3,4,35
4,3,55
5,2,55
6,8,70
7,7,60
8,8,70


In [7]:
#CORRECT SOLUTION
def get_numeric_cols(df):
    return df.select_dtypes(include=np.number)

get_numeric_cols(df)

Unnamed: 0,Years,Salary
0,1,35
1,2,30
2,2,30
3,4,35
4,3,55
5,2,55
6,8,70
7,7,60
8,8,70


In [8]:
#List of column names that are numeric
df.select_dtypes(include=np.number).columns.tolist()

['Years', 'Salary']

In [9]:
# presenting only the columns
df.select_dtypes(include=[np.number])

Unnamed: 0,Years,Salary
0,1,35
1,2,30
2,2,30
3,4,35
4,3,55
5,2,55
6,8,70
7,7,60
8,8,70


#### Reacting to Data

Now test your function:

In [10]:
# your code here
get_numeric_cols(df)

Unnamed: 0,Years,Salary
0,1,35
1,2,30
2,2,30
3,4,35
4,3,55
5,2,55
6,8,70
7,7,60
8,8,70


You should have seen:

```
   Years  Salary
0      1      35
1      2      30
2      2      30
3      4      35
4      3      55
5      2      55
6      8      70
7      7      60
8      8      70
```



Yes, this is exactly what we want!

Now we move to the next sub problem.

### Sub Problem 2

#### Setting Expectations

**Define problem: How to calculate minimum, mean. and maximum?**

#### Collecting Information

Extract the numeric columns from the dataframe and then calculate min, max and mean on those.

In [11]:
# your code here
get_numeric_cols(df).mean()
get_numeric_cols(df).max()
get_numeric_cols(df).min()

Years      1
Salary    30
dtype: int64

After inspecting the output we find there is no revision required. So we move to the next sub problem.

### Sub Problem 3

#### Setting Expectations

**Define problem: How to perform calculations for each Department?**

#### Collecting Information

What we need is to aggregate data by Department. Assign the aggregated data to a new variable called `employee_by_department`

In [17]:
# your code here
employee_by_department=df.groupby('Department')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1119f0a30>

#### Reacting to Data

Test to calculate the means of each department:

In [13]:
# your code here
employee_by_department.mean() #wird automatisch nur auf die numerischen Spalten angewendet

Unnamed: 0_level_0,Years,Salary
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,4.666667,45.0
IT,4.5,48.75
Sales,2.5,55.0


This is what we expect for this sub problem. Now we are ready to combine the solutions of all three sub problems in order to solve the main problem.

## Main Problem - Reacting to Data / Revising Expectations

It turns out Pandas is smart enough to perform `mean` calculations on numeric columns only even if the data set contains non-numeric fields. We can choose to revise our solution because it is not really necessary to obtain the numeric columns (Sub Problem 1) by ourselves. In this case we simply combine solutions for Sub Problem 2 & 3. Write your codes below to compute min, mean and max by department.

In [14]:
# your code here
#employee_by_department.max()
employee_by_department.mean()
#employee_by_department.min()

Unnamed: 0_level_0,Years,Salary
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,4.666667,45.0
IT,4.5,48.75
Sales,2.5,55.0


Alternatively, we can choose to stick to our original solution that combines all 3 sub problems. We want to do this because we will have more control over what we want to do with the data. What if the goal is not to perform MIN, MEAN, and MAX? What if the question is to apply a custom function you wrote which cannot automatically select numeric columns to perform? It is good that we figure out how to do this.

Write your code below that uses one line of code to perform MIN/MEAN/MAX respectively while selecting the numeric columns.

*Hint: use `apply` and `lambda`*

In [18]:
df[df.select_dtypes(include=np.number).columns.tolist()].apply(lambda x: x.min())


Years      1
Salary    30
dtype: int64

In [26]:
# enter your code here
#df[df.select_dtypes(include=np.number).columns.tolist()].apply(lambda x: x.min())
#df[df.select_dtypes(include=np.number).columns.tolist()].apply(lambda x: x.max())
#df[df.select_dtypes(include=np.number).columns.tolist()].apply(lambda x: x.mean())

Years      4.111111
Salary    48.888889
dtype: float64

Test your codes and see if you will receive outputs similar to the following:

```
PRINTING DEPARTMENT MIN:

Years      1
Salary    30
dtype: int64

---

PRINTING DEPARTMENT MEAN:

Years      4.111111
Salary    48.888889
dtype: float64

---

PRINTING DEPARTMENT MAX:

Years      8
Salary    70
dtype: int64
```

In [20]:
mins = df.groupby('Department').apply(lambda a : get_numeric_cols(a)).min()
mins

Years      1
Salary    30
dtype: int64

If you don't see the correct output, check your codes and revise.