(Don't let the length of this notebook intimidate you. Most of the length comes from the large amount of retrieved data and the code itself is short and easy to follow)

# Table of Contents

- Load CSV Data
- A Word About Accessing Columns
- Selecting Data
  - Default Selection Mechanism
  - Selecting Rows by Index and Columns By Name
  - Selecting Rows and Columns By Position
  - Extracting Head/Tail Rows
- Filtering
  - Basic Idea of Filtering: Masking
  - Combining Multiple Conditions
  - Selecting Columns While Filtering
- Operating on Data
  - Finding Unique Values
  - View vs Copy
  - Adding a Column to a Data Frame
  - Finding Counts of Different Values
- Grouping & Aggregation
  - Mean/Standard Deviation/Count/Etc
  - Grouping By Ranges of Values
- Handling Missing Data
  - Remove Rows With Missing Data
  - Filling Empty Values
- Plotting
  - 2D Plot
  - Bar Chart
  - Pie Chart
  - Multiple Graphs
  - Other Plot Types
- Summary


Following up on [ML Bootcamp: Intro to NumPy](https://www.kaggle.com/rafidka/ml-bootcamp-intro-to-numpy), the next important step in preparing to embark on a machine learning task is data manipulation. Machine leraning deals with a huge amount of data and without proper methods for extracting knowledge from data. 

In this notebook, I will be using the [Stack Overflow Developer Survey for 2019](https://www.kaggle.com/mchirico/stack-overflow-developer-survey-results-2019). This way, I hope, the reader will gain the additional benefit of getting some insight into Stack Overflow Developer Survey at the same time as learning pandas.

# Load CSV Data

Obviously, the first step in processing data is to load them. Pandas support [multiple file formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), including CSV which we will be using here.

In [None]:
import numpy as np
import pandas as pd

complete_survey = pd.read_csv("../input/stack-overflow-developer-survey-results-2019/survey_results_public.csv")
complete_survey_schema = pd.read_csv("../input/stack-overflow-developer-survey-results-2019/survey_results_schema.csv")

In [None]:
 # Let's get a feeling of what the data looks like.
print(f"(rows x columns) = {complete_survey.shape}")
complete_survey

Since the dataset has many columns and rows, the notebook only shows part of the columns and rows. First, let's print all the columns to see what data is available for us. We could use the pandas DataFrame's [columns](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html) field, but this will only shows the columns name and we have to guess what they mean. Luckily, the [survey](https://insights.stackoverflow.com/survey) also provides the schema of the table, which we loaded above into the `survey_schema` variable.

In [None]:
from IPython.core.display import HTML

pd.set_option('display.max_rows', 100) # Ensure that we see all results.
pd.set_option('display.max_colwidth', -1) # Ensure that we display the complete text description.
complete_survey_schema

Looking at the available columns, I am interested in studying the following columns:

- MainBranch
- Hobbyist
- OpenSourcer
- Employment
- Country
- Student
- EdLevel
- UndergradMajor
- DevType
- YearsCode
- Age1stCode
- YearsCodePro
- ConvertedComp
- LanguageWorkedWith
- Age
- Gender

In [None]:
survey = complete_survey[[
    'MainBranch',
    'Hobbyist',
    'OpenSourcer',
    'Employment',
    'Country',
    'Student',
    'EdLevel',
    'UndergradMajor',
    'DevType',
    'YearsCode',
    'Age1stCode',
    'YearsCodePro',
    'ConvertedComp',
    'LanguageWorkedWith',
    'Age',
    'Gender'
]]
survey

# A Word About Accessing Columns

The syntax above for selecting some of the columns might be a little bit strange at first. What are the double square brackets `[[` and `]]`? This section will help understand this.

To access a single column in Pandas we simply use the column's name:

In [None]:
survey['Hobbyist']

To access multiple columns, we need to pass an array containing the name of the columns we want, which explains the syntax above. To make it clearer, we could instead type:

```python
survey = survey[ [
    'MainBranch',
    'Hobbyist',
    # ...
] ]
```

We could also assign the column names to a variable:

```python
columns = [
    'MainBranch',
    'Hobbyist',
    # ...
]
survey = survey[columns]
```

It is a personal preference, but I personally prefer to reduce the number of unnecessary variables and make my code as compact as possible.

# Selecting Data

Let's see various ways we can select data from a Pandas data frame.


## Default Selection Mechanism

We can use the square brackets operator `[` and `]` on a data frame to select data. 

In [None]:
# Select rows 0 to 4.
survey[0:5]

In [None]:
# Select Country column
survey['Country']

Notice that the above code returns a [Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) object instead of a DataFrame. The difference is that a series is a linear set of values, instead of two dimensional like a data frame. Essentially, a data frame is a set of multiple data series sharing the same index, each representing a different column.

If you want to return a data frame instead, you should pass the column(s) as an array:

In [None]:
survey[ ['Country'] ]

To select multiple rows and columns, you can apply the selectors above one after the other, for example:

In [None]:
# Extract 
survey[0:10][ ['Country', 'Student', 'EdLevel' ] ]


## Selecting Rows by Index and Columns By Name

Two things distinguishes Pandas data frames form NumPy 2-dimesnional arrays:

1. Pandas data frame has an index that distinguishes different rows. In NumPy, rows and columns are always specified by a numerical index between 0 and the length of dimension - 1.  With pandas, however, the index can be anything, numerical or non-numerical, without any strict order. The benefit of this is that rows are uniquely identified by an index so even after applying a filter that removes some of the columns, the same row is still uniquely identified by the same index.

2. Pandas data frame has columns with specific names. For example, in the data frame above for Stack Overflow survey, we don't have to remember the numerical index of the column containing the country of the responder, we simply use the `Country` field.

To use indexes and column names for selecting data, we can use the `loc` field.


In [None]:
# Select individual rows by index.
survey.loc[ [1, 2, 5] ]

In [None]:
# Select inidivdual columns by name.
survey.loc[ :, ['MainBranch', 'Country'] ]

In [None]:
# Select all columns between MainBranch and Country.
survey.loc[ [1, 3, 10], ['MainBranch', 'Country'] ]

In [None]:
# Select all columns between MainBranch and Country
survey.loc[:, 'MainBranch':'Country']

One thing to notice is that the indexes for rows happened to be numbers here, but they don't need to be so. For example, let's take the first 5 rows and change the index to use alphabetical letters.

In [None]:
survey_new_index = survey.head(5)
survey_new_index.index = ['a', 'b', 'c', 'd', 'e']
survey_new_index

Having this new data frame, let's see how we can extract rows:

In [None]:
survey_new_index.loc['b':'d']

## Selecting Rows and Columns By Position

If you want to select data from a data frame using the NumPy-style of specifying the numerical position of the row(s)/column(s), you can use the `iloc` field.

In [None]:
# Select the first row
survey.iloc[0]

In [None]:
# Select the first column
survey.iloc[:,0]

In [None]:
# Selecting the coll in the 4th row and 5th column
survey.iloc[3, 4]

## Extracting Head/Tail Rows

It is sometimes useful to extract the first or last few columns. For this we can use the `head` and `tail` methods:

In [None]:
survey.head(3)

In [None]:
survey.tail(3)

# Filtering

We can filter down the data in a data frame by applying operators on the values of the columns. For example, we might want to extract all rows where the `Country` column is `Canada`, or the salary is less/higher than a certain value. 

## Basic Idea of Filtering: Masking

Let's start with an example that explains the basic idea of how filtering works in pandas. Let's extract responses coming from Canada:

In [None]:
survey[survey['Country'] == 'Canada']

At first, it might seem a bit hard to understand the syntax of the code above; why do we need to specify `survey['Country] < 1000` inside as operand to `survey[]` itself? To understand, let's evaluate what is inside the square brackets:

In [None]:
survey['Country'] == 'Canada'

As you can see, it is returning a series containing True/False values. Passing this series to `survey[ ]` acts as a mask, specifying which indexes to extract and which one to ignore.

## Combining Multiple Conditions

We can apply more than one condition at the same time. For example, to extract responses from full-time employees in Bulgaria, we can use the following code:

In [None]:
survey[ (survey['Country'] == 'Bulgaria') & (survey['Employment'] == 'Employed full-time')]

## Selecting Columns While Filtering

While filtering rows according to certain criteria, we could select certain columns by using the `loc` field. Let's modify the code above to only return some columns:

In [None]:
survey.loc[ (survey['Country'] == 'Bulgaria') & (survey['Employment'] == 'Employed full-time'), ["Country", "Employment"] ]

The use of `loc` is not much different than the examples above. The first parameter is always the row selector, only this time we are providing a criteria for what to retrieve instead of a range or specific indexes.

# Operating on Data

The rest of this notebook is dedicated to exploring different functionalities available in the pandas while at the same time trying to extract something meaningful from the survey data.



## Finding Unique Values

Let's start by inspecting the different possible values of some categorical fields to give us some insight into what we can do.

In [None]:
from IPython.display import HTML, display
import tabulate

def print_series(series):
    """
    A helper function for displaying a series using HTML.
    """
    series_as_table = map(lambda x: [x], series)
    display(HTML(tabulate.tabulate(series_as_table, tablefmt='html')))

In [None]:
print_series(survey['MainBranch'].unique())

In [None]:
print_series(survey['OpenSourcer'].unique())

In [None]:
print_series(survey['Employment'].unique())

## View vs Copy

When we created the `survey` data frame, we executed the following code:

```python
survey = complete_survey[ [
    # columns
] ]
```

This allowed us to have another data frame containing only the colunms we selected. One important note to mention about this is that the `survey` data frame doesn't actually contain a copy of the data, but instead a [view](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy). This is a similar to the concept of reference and pointer in programming languages.

Care must be taken when trying to operate on a view. For example, let's try to add another column to the `survey` data frame. The `ConvertedComp` column contains the salary per year. What if we want to find the monthly compensation, as is commonly the case in many countries around the world? Let's try to add this column to the `survey` data frame.

In [None]:
survey['MonthlyConvertedComp'] = survey['ConvertedComp']/12

As you can see, this call failed and the reason is that we were trying to add a new column to a view of a data frame, which is not possible. The following section will explain how to solve this problem.

## Adding a Column to a Data Frame

Adding a column to a data frame as is easy as executing a call like the following:

```
survey[<new column>] = survey[<column>]
```

The new column will contain the same values in the given column. Obviously, this is dull and will only result in duplicated data, so it is worth mentioning that you can execute operation on the column before assigning it to the new column, just like [ufunc](https://docs.scipy.org/doc/numpy/reference/ufuncs.html) in NumPy. For example:

In [None]:
survey_ext = survey.copy()
survey_ext['MonthlyConvertedComp'] = survey_ext['ConvertedComp']/12
survey_ext[ ['ConvertedComp', 'MonthlyConvertedComp'] ]

Notice that we had to make a copy of `survey` as it is a view and cannot be modified in this way as explained in the previous section.

## Finding Counts of Different Values

How do we find the occurrence count of different values? For example, how many respondents are coming from different countries? Or how many respondants are hobbyist? We can use the `value_counts()` method for this:

In [None]:
survey['Country'].value_counts()

Or let's find how many are hobbyist. Notice that we used `normalize=True` in the code below to show percentages instead.

In [None]:
survey['Hobbyist'].value_counts(normalize=True) 

Let's find a breakdown of the education levels of the respondants.

In [None]:
survey['EdLevel'].value_counts()

# Grouping & Aggregation

It is common to group data by a certain field. For example, we might want to split the survey by the country of residence (grouping). Then, we could generate the same statistics for different countries (aggregation). Pandas make this possibly via the `groupby` method, which generates a special kind of data frame or series. For example, calling the `groupby` method on a DataFrame will return an object of type [DataFrameGroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html):


In [None]:
survey_by_country = survey.groupby('Country')
type(survey_by_country)

You can think of this object as multiple data frames combined together in one object, each data frame is distinguishable by the value of the grouping. For example, in our code above, the result is a `DataFrameGroupBy` object whose keys are the different countries and values are a DataFrame per country. To understand this more, let's inspect the [indices](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.indices.html#pandas.core.groupby.GroupBy.indices) field of the grouping. This field returns a dictionary whose keys are the different values of the column we grouped by (let's call it the group-by column) and values are the indices of the rows from the original data frame whose value of the group-by column matches the key.

Since printing the complete dictionary will be huge, let's print the keys only and then the value of one of the keys.

In [None]:
survey_by_country.indices.keys()

In [None]:
survey_by_country.indices['Afghanistan']

We see that for `Afghanistan` we have the following indices: 719, 6391, ... etc. Let's display the `Country` column of the row with index `719` to verify that it is indeed `Afghanistan`.

In [None]:
survey.loc[719, 'Country']

## Mean/Standard Deviation/Count/Etc



Let's find the average salary by country:

In [None]:
survey.groupby('Country')['ConvertedComp'].mean()

Let's also find the standard deviation for each country:

In [None]:
survey.groupby('Country')['ConvertedComp'].std()

It is easier to read if we have both aggregations in one table. For this, we can use the [agg](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function:

In [None]:
survey.groupby('Country')['ConvertedComp'].agg(["mean", "std"])

## Grouping By Ranges of Values

In the previous section, we see how we can group by specific values, e.g. a specific country. What if we want to group by a range of values? For example, group by ages between 0 - 10, 11 -20, and so on. For this, we can use the [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) method for this. For example, let's find the number of respondants per different each groups.

In [None]:
survey.groupby(pd.cut(survey['Age'], np.arange(0, 101, 10)))['Age'].count()

# Handling Missing Data

You might noticed in many of the results above that we occasionally see the value `NaN`. This happens when there is no value in a certain cell, or a certain operation cannot be carried out like when we try to calculate the standard deviation with only one sample. This is usually undesirable in machine learning as we don't know what to do with those values. There are multiple ways to handle missing data in pandas.

## Remove Rows With Missing Data

The simpliest scenario is removing any row containing missing data, which we can do using the [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method.

In [None]:
survey['ConvertedComp'].dropna()

As simple as that! As you can see now, every retrieved result is not empty. However, we are only seeing the salary. What if we want to see other columns? 

In [None]:
survey.dropna(subset=['ConvertedComp'])

## Filling Empty Values

The other option for handling empty values is to fill them with some value. For example, for those respondants who haven't provided the salary, one option might be to assume they are getting the average salary. (Obviously, this is a wrong assumption, but it is used here for the sake of explanation.) We can achieve this by using the [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function:

In [None]:
# Let's first find rows containing no salary information
survey_index_no_salary = survey['ConvertedComp'].isnull()
survey[survey_index_no_salary]

In [None]:
salary_mean = survey['ConvertedComp'].mean()
survey.fillna({'ConvertedComp': salary_mean})[survey_index_no_salary] # Display the rows which didn't contain salary.

# Plotting

The main method for plotting in pandas is the `plot` method of Series and DataFrame. For Series, the index is the x-axis and the y-axis represent the values of the series. DataFrame is plot the same way, except that each column has its own graph in the y-axis; essentially, think of plotting a data frame is plotting multiple series sharing the same index.




## 2D Plot

To demonstrate 2D-plots, let's plot the average salary per age.


In [None]:
salary_by_age = survey[ ['Age', 'ConvertedComp'] ].dropna().groupby('Age')
salary_by_age.mean()
salary_by_age.mean().dropna().plot()

The graph is distorted by a few points which seem (at least to me) to be unrealistic; it is hard to justify the spike in the mean salary near the age 30. Let's use the [quantile](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html) method of the DataFrame to leave entries below 5% percentile and above 95% percentile.

In [None]:
p5, p95 = survey['ConvertedComp'].quantile(0.05), survey['ConvertedComp'].quantile(0.95)
salary_by_age = survey[ (survey['ConvertedComp'] >= p5) & (survey['ConvertedComp'] <= p95) ] [ ['Age', 'ConvertedComp'] ].dropna().groupby('Age')
salary_by_age.mean().dropna().plot()

The graph is better now, we don't see a 1 million spike in average which is hard to explain, but it is still a bit strange. I wonder whether this is caused by the different pay scale among different countries. Let's repeat the same operation above, but limiting the result to one country, e.g. United Kingdom.

In [None]:
survey_uk = survey[survey['Country'] == 'United Kingdom']
p5, p95 = survey_uk['ConvertedComp'].quantile(0.05), survey_uk['ConvertedComp'].quantile(0.95)
salary_by_age = survey_uk[ (survey_uk['ConvertedComp'] >= p5) & (survey_uk['ConvertedComp'] <= p95) ] [ ['Age', 'ConvertedComp'] ].dropna().groupby('Age')
salary_by_age.mean().dropna().plot()

The data seems to be more reasonable now, though still spiky and not smooth, which I don't find an explanation for! I leave it to the reader to decide whether to play with this data more to understand what is going on!

## Bar Chart

Plotting bar charts is no different than normal 2D plot; we simple use the `plot` method but pass `kind=bar` to the arguments. For example, let's see inspect the `OpenSourcer` column via bar chart.


In [None]:
survey.groupby('OpenSourcer')['OpenSourcer'].count().plot(kind='bar')

## Pie Chart

Pie chart is very useful when we have a limited number of categories. Let's convert the bar chart above into a pie chart.

As expected, the vast majority of respondants are not actively contributing to open source.

In [None]:
survey.groupby('OpenSourcer')['OpenSourcer'].count().plot(kind='pie')

## Multiple Graphs

Calling the `plot` method on a DataFrame will generate multiple graphs, one per column. Let's expand the 2D graph above to multiple countries. To be more accurate, I should say break down the graph by country. 

In [None]:
series = []
countries = ['United States', 'Japan', 'United Kingdom', 'Canada', 'Germany', 'Italy', 'Russia']
# For each of the countries above, generate an aggregation for the mean of compensation by age.
for country in countries:
    survey_by_country = survey[survey['Country'] == country]
    salary_by_age = survey_by_country[ ['ConvertedComp'] ].groupby(pd.cut(survey_by_country['Age'], np.arange(0, 101, 5)))
    series.append(salary_by_age.mean())
# Concatate the result into a data frame and plot.
c = pd.concat(series, axis=1)
c.columns = countries
c.plot()


## Other Plot Types

There are multiple other plot types that can be generated using pandas. The basic idea is the same, so I will leave the reader with the [visualization page](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) of pandas documentation to learn about and experiment with different plot types.

# Summary

Pandas is a huge library and it is hard to cover all its functionality in one notebook. However, I tried to cover multiple different aspect of this library to set the reader up to speed with using this amazing library. More information about the library can be found on the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html) of pandas.
