# Python for Data Science Practice Session 1: Economics and Finance

# Employees Performance Analysis

The [Productivity Prediction of Garment Employees Data Set](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees) includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

In this notebook, we will assume that this dataset includes data from just one company. The company's management is interested in extracting some specific information about their employee's performance. The tasks we are going to work on are:
1. **Performance check** - get specified columns of a sample from the dataset
2. **Performance ranking** - filter and sort the dataset following specified rules
3. **Teams Ranking** - create a scoreboard for each team
4. **Lottery** - take a random sample out of rows that satisfy given conditions

If you are struggling with anything, check the **Tips** section at the end of the notebook. At the end of some tasks, you can find a number in parentheses that references the Tips section.

Let's get started!

The first step is to import the libraries that we will be using - in our case, `pandas`. Import it as `pd`, so that we could refer to it easier in the future.

In [1]:
#import....

Now we need to import the dataset. The dataset is available [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00597/) (click on the `garments_worker_productivity.csv`, and it should download automatically). Save it as `all_data`. *(1)*

In [2]:
#all_data = ....

Now, check if you have correctly imported and saved `all_data`.

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,1/1/2015,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,1/1/2015,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,1/1/2015,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


First, to get a grasp of a dataset, check the number of rows and columns.

(1197, 15)

Any missing data could cause serious problems for the program. Inspect `all_data` to see the count of values in each column. Base on those pieces of information you can see if any values are missing.

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
count,1197.0,1197.0,1197.0,691.0,1197.0,1197.0,1197.0,1197.0,1197.0,1197.0,1197.0
mean,6.426901,0.729632,15.062172,1190.465991,4567.460317,38.210526,0.730159,0.369256,0.150376,34.609858,0.735091
std,3.463963,0.097891,10.943219,1837.455001,3348.823563,160.182643,12.709757,3.268987,0.427848,22.197687,0.174488
min,1.0,0.07,2.9,7.0,0.0,0.0,0.0,0.0,0.0,2.0,0.233705
25%,3.0,0.7,3.94,774.5,1440.0,0.0,0.0,0.0,0.0,9.0,0.650307
50%,6.0,0.75,15.26,1039.0,3960.0,0.0,0.0,0.0,0.0,34.0,0.773333
75%,9.0,0.8,24.26,1252.5,6960.0,50.0,0.0,0.0,0.0,57.0,0.850253
max,12.0,0.8,54.56,23122.0,25920.0,3600.0,300.0,45.0,2.0,89.0,1.120437


As the row `count` tells us, some values are missing in the `wip` column - we need to keep that in mind.

## Performance check

Let's begin this part of the notebook by showing the sample of four rows of our dataset.

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
554,2/1/2015,Quarter1,sweing,Sunday,12,0.75,15.26,1276.0,1440,45,0.0,0,0,35.0,0.750451
772,2/15/2015,Quarter3,sweing,Sunday,5,0.8,30.1,679.0,7140,0,0.0,0,0,59.5,0.722569
783,2/15/2015,Quarter3,finishing,Sunday,4,0.75,4.15,,2400,0,0.0,0,0,20.0,0.287042
764,2/14/2015,Quarter2,finishing,Saturday,6,0.8,2.9,,960,0,0.0,0,0,8.0,0.483333


As you can see, the sample includes a lot of different columns. To make working on the data easier, you can limit the displayed data using `loc[]`.

(Take for example `date`,`department`,`team`,`no_of_workers`,`targeted_productivity`,`actual_productivity`.)

Unnamed: 0,date,department,team,no_of_workers,targeted_productivity,actual_productivity
409,1/24/2015,finishing,4,12.0,0.75,0.651515
976,2/28/2015,finishing,12,9.0,0.8,0.590617
995,3/1/2015,sweing,2,58.0,0.7,0.683551
926,2/25/2015,finishing,10,10.0,0.7,0.845833


In case you find something concerning in the sample, you might want to save it for further analysis. Let's save it and call the variable `random_check`.

In [8]:
#random_check =

You can check whether you have saved the data correctly by showing your variable.

Unnamed: 0,date,department,team,no_of_workers,targeted_productivity,actual_productivity
879,2/22/2015,sweing,10,50.0,0.7,0.456875
135,1/8/2015,finishing,5,8.0,0.7,0.821354
525,1/31/2015,finishing,9,2.0,0.75,0.971867
809,2/17/2015,sweing,6,33.0,0.75,0.750621


If you think that some values in your sample are concerning (too high or too low), you can take the mean to check whether your observations should concern you. Let's say that the number of workers in your sample seems too low. Check if that's the case by taking the mean.

23.25

# Performance Ranking

Let's begin once again by showing our dataset.

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,1/1/2015,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,1/1/2015,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,1/1/2015,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


Sometimes you are interested only in some particular rows. To focus only on the data that you are interested in, you can use `loc[]`. (Show only the rows where `actual_productivity` is lower than `targeted_productivity`)

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
11,1/1/2015,Quarter1,sweing,Thursday,10,0.75,19.31,578.0,6480,45,0.0,0,0,54.0,0.712205
12,1/1/2015,Quarter1,sweing,Thursday,5,0.80,11.41,668.0,3660,50,0.0,0,0,30.5,0.707046
14,1/1/2015,Quarter1,finishing,Thursday,8,0.75,2.90,,960,0,0.0,0,0,8.0,0.676667
15,1/1/2015,Quarter1,finishing,Thursday,4,0.75,3.94,,2160,0,0.0,0,0,18.0,0.593056
16,1/1/2015,Quarter1,finishing,Thursday,7,0.80,2.90,,960,0,0.0,0,0,8.0,0.540729
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


We will be working on those data later. To access them easier, save them as `below_target`.

In [13]:
#below_target = 

Check if you have correctly saved the data filtered data.

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
11,1/1/2015,Quarter1,sweing,Thursday,10,0.75,19.31,578.0,6480,45,0.0,0,0,54.0,0.712205
12,1/1/2015,Quarter1,sweing,Thursday,5,0.80,11.41,668.0,3660,50,0.0,0,0,30.5,0.707046
14,1/1/2015,Quarter1,finishing,Thursday,8,0.75,2.90,,960,0,0.0,0,0,8.0,0.676667
15,1/1/2015,Quarter1,finishing,Thursday,4,0.75,3.94,,2160,0,0.0,0,0,18.0,0.593056
16,1/1/2015,Quarter1,finishing,Thursday,7,0.80,2.90,,960,0,0.0,0,0,8.0,0.540729
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


If you want to filter the data once again, you can do it using `loc[]` (keep only rows with positive `wip`). *(2)*

In [15]:
#below_target =

Once you got your data filtered, you can sort it by your chosen value. You can also combine it with `.head()` or `.tail()` to see only a specified number of 'best' or 'worst' rows. (sort data by `wip` and show 25 best rows) *(3)*

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
919,2/24/2015,Quarter4,sweing,Tuesday,10,0.7,21.25,1834.0,6360,0,0.0,0,1,53.0,0.471108
879,2/22/2015,Quarter4,sweing,Sunday,10,0.7,21.25,1531.0,6000,0,0.0,0,1,50.0,0.456875
898,2/23/2015,Quarter4,sweing,Monday,10,0.7,21.25,1583.0,6000,0,0.0,0,1,50.0,0.417917
558,2/1/2015,Quarter1,sweing,Sunday,8,0.6,24.26,1196.0,6600,0,0.0,0,0,55.0,0.466821
579,2/2/2015,Quarter1,sweing,Monday,8,0.65,24.26,1435.0,6600,0,0.0,0,0,55.0,0.260979


## Teams Ranking

Now, based on the commands that you have done before, take `team`,`targeted_productivity`,`actual_productivity`,`over_time`,`wip` columns, and save it as `teams`. Make a copy, not a reference.

In [17]:
#teams =

As we remember, our dataset has some missing values in `wip` column. To prevent any errors, fill them with zeros. *(4)*

In [18]:
#teams['wip'] =

Now use the command that you have used at the beginning to check if you have correctly filled missing values.

Unnamed: 0,team,targeted_productivity,actual_productivity,over_time,wip
count,1197.0,1197.0,1197.0,1197.0,1197.0
mean,6.426901,0.729632,0.735091,4567.460317,687.22807
std,3.463963,0.097891,0.174488,3348.823563,1514.582341
min,1.0,0.07,0.233705,0.0,0.0
25%,3.0,0.7,0.650307,1440.0,0.0
50%,6.0,0.75,0.773333,3960.0,586.0
75%,9.0,0.8,0.850253,6960.0,1083.0
max,12.0,0.8,1.120437,25920.0,23122.0


If you want to combine values by chosen category (for example mean value of each column for each team), you can use `groupby`. Save the result of grouping as a new dataframe named `teams_performance`. *(5)*

In [20]:
#teams_performance =

Let's check if you correctly grouped data. Show `teams_performance`.

Unnamed: 0_level_0,targeted_productivity,actual_productivity,over_time,wip
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.746667,0.821054,4793.428571,858.238095
2,0.739908,0.770855,4384.954128,693.559633
3,0.742105,0.80388,5375.684211,860.410526
4,0.717619,0.770035,5449.714286,684.780952
5,0.673656,0.697981,5330.967742,482.548387
6,0.731383,0.685385,3369.095745,587.840426
7,0.714271,0.668006,4857.1875,572.635417
8,0.708257,0.674148,4312.293578,505.733945
9,0.758173,0.734462,4519.038462,715.923077
10,0.7385,0.719736,4736.7,871.15


Because actual productivity is a poor reflection of productivity, create a new column called `relative_productivity`, which is equal to `actual_productivity` divided by `targeted_productivity`.

In [22]:
#teams_performance['relative_productivity'] =

To get a grasp of the relative productivity distribution, show minimum and maximum of relative productivity. *(6)*

Unnamed: 0,targeted_productivity,actual_productivity,over_time,wip,relative_productivity
count,12.0,12.0,12.0,12.0,12.0
mean,0.729063,0.733882,4565.791126,686.064163,1.00636
std,0.027073,0.053645,695.892459,133.914372,0.058672
min,0.673656,0.668006,3317.929293,482.548387,0.935227
25%,0.712767,0.684535,4334.948394,584.039173,0.964505
50%,0.734941,0.727099,4627.869231,689.170293,0.990404
75%,0.743246,0.772905,4975.63256,777.703463,1.049629
max,0.774242,0.821054,5449.714286,871.15,1.099626


Now, to create a simple scoring system, follow the commands listed below:

Create `rel_min` and `rel_max` which are respectively minimum and maximum values of `relative_productivity` column. *(7)*

In [24]:
#rel_min =

In [25]:
#rel_max =

Do the same for `over_time` and `wip` columns (name them `time_min`,`time_max`, `wip_min` and `wip_max`)

In [26]:
#time_min =

In [27]:
#time_max =

In [28]:
#wip_min =

In [29]:
#wip_max =

Create a new empty dataframe called `score_board` with the same indexes as `teams_performance`. *(8)*

In [30]:
#score_board =

Create columns `productivity_points`, `overtime_points` and `wip_penalty` using given formula:

$$ points = \frac{(value - min\_value) \cdot max\_points}{max\_value - min\_value} $$

* min_value - minimum value of given column
* max_value - maximum value of given column
* max_points - points for a maximum score (100 for `productivity_points`, 50 for `overtime_points` and (-30) for `wip_penalty`

In [4]:
#score_board['performance_points'] =

In [2]:
#score_board['overtime_points'] =

In [33]:
#score_board['wip_penalty'] =

Now round all the numbers in `score_board` to intiger. *(9)*

In [34]:
#score_board =

Add up all scores in a new column called `total_score`.

In [35]:
#score_board['total_score']=

Show the `score_board` to check if everything is alright.

Unnamed: 0_level_0,performance_points,overtime_points,wip_penalty,total_score
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,100.0,35.0,-29.0,106.0
2,65.0,25.0,-16.0,74.0
3,90.0,48.0,-29.0,109.0
4,84.0,50.0,-16.0,118.0
5,61.0,47.0,-0.0,108.0
6,1.0,1.0,-8.0,-6.0
7,0.0,36.0,-7.0,29.0
8,10.0,23.0,-2.0,31.0
9,20.0,28.0,-18.0,30.0
10,24.0,33.0,-30.0,27.0


Once you have finished your scoring system, you can show the 3 worst teams ranked by `total_score`.

Unnamed: 0_level_0,performance_points,overtime_points,wip_penalty,total_score
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,1.0,1.0,-8.0,-6.0
12,43.0,0.0,-21.0,22.0
10,24.0,33.0,-30.0,27.0


You can also easily show the 3 best teams.

Unnamed: 0_level_0,performance_points,overtime_points,wip_penalty,total_score
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4,84.0,50.0,-16.0,118.0
3,90.0,48.0,-29.0,109.0
5,61.0,47.0,-0.0,108.0


## Lottery

Let's say that the company is running the lottery for the rows with a productivity of more than 0.95. Filter the rows following this rule.

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
19,1/3/2015,Quarter1,finishing,Saturday,4,0.80,4.15,,6600,0,0.0,0,0,20.0,0.988025
20,1/3/2015,Quarter1,finishing,Saturday,11,0.75,2.90,,5640,0,0.0,0,0,17.0,0.987880
21,1/3/2015,Quarter1,finishing,Saturday,9,0.80,4.15,,960,0,0.0,0,0,8.0,0.956271
40,1/4/2015,Quarter1,finishing,Sunday,3,0.75,4.15,,1560,0,0.0,0,0,8.0,0.991389
61,1/5/2015,Quarter1,finishing,Monday,1,0.80,3.94,,1920,0,0.0,0,0,8.0,0.961059
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1025,3/3/2015,Quarter1,finishing,Tuesday,7,0.80,4.60,,4200,0,0.0,0,0,10.0,0.999533
1068,3/5/2015,Quarter1,finishing,Thursday,8,0.80,4.60,,2640,0,0.0,0,0,22.0,0.980985
1069,3/5/2015,Quarter1,finishing,Thursday,2,0.60,3.90,,960,0,0.0,0,0,8.0,0.950625
1106,3/8/2015,Quarter2,finishing,Sunday,3,0.80,4.60,,1440,0,0.0,0,0,12.0,0.951944


Unfortunately, the lottery is only for 3 rows, so you need to take them out of the dataset. Also, the lottery organisers are interested only in columns: `date`, `team` and `actual_productivity`.

Unnamed: 0,date,team,actual_productivity
599,2/4/2015,2,1.050281
602,2/4/2015,2,0.966759
478,1/28/2015,3,1.00023


You also need to save the results so that you can easily send them over to lottery organisers (save them as `lottery_results.csv`).

# Tips:
* (1)   Make sure that our dataset is **in the same folder** as the notebook. The data are sepereated by **comma**. -> [help](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
* (2) Use the format: *dataframe = dataframe.loc[condition]*
* (3) -> [help](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
* (4) Use the format: *dataframe['column_name'] = dataframe['column_name'].fillna()*. -> [help](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)
* (5) Use the format: *new_dataframe = dataframe.groupby().mean()*. -> [help](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
* (6) Use `.describe()`
* (7) Combine `.describe()` with `loc[]`
* (8) -> [help](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
* (9) -> [help](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html)