# Introduction to Pandas

**Pandas** is a python library that has been created by **Wes McKinney** in 2007 on top of the **Numpy** library, and had become open sourced by the end of 2009. It is widely used in data science, machine learning and data analysis tasks. The name is derived from the term **"PANel DAta"** (tabular data), an econometrics term for data sets that include observations over multiple time periods for the same individuals. ([wikipedia](https://en.wikipedia.org/wiki/Pandas_(software)))

**Pandas** works well together with other librairies such as **Matplotlib / Seaborn** and, of course, **Numpy**.

<img src="files/pandas_numpy_matplotlib_seaborn.png" width="100%" align="center">

In [None]:
import pandas as pd # We'll use pd as the alias
import numpy as np # and np as alias for numpy

# DataFrame

A Pandas DataFrame is a two-dimensional, size-mutable, and highly flexible data structure for data manipulation and analysis in Python. The DataFrame is often compared to a spreadsheet or a SQL table, as it organizes data into rows and columns, making it easy to work with structured data.

Key characteristics of a Pandas DataFrame:

- **Two-Dimensional**: A DataFrame consists of rows and columns, much like a table in a relational database or an Excel spreadsheet.

- **Size-Mutable**: You can add or remove rows and columns from a DataFrame, making it adaptable to changing data.

- **Labeled Axes**: Both rows and columns have labels (index and column names), allowing for easy identification and indexing of data.

- **Mixed Data Types**: A DataFrame can contain data of different types (e.g., integers, floats, strings) in different columns, or inside the same column.

- **Missing Data Handling**: DataFrames can handle missing or NaN (Not-a-Number) values gracefully, providing tools for detecting, removing, or imputing missing data.

- **Operations**: DataFrames support a wide range of data operations, including filtering, grouping, aggregation, pivoting, merging, and more.

## Creating a DataFrame

Pandas can create DataFrames from many differents file formats including :

- CSV (comma separated value)
- Excel (XLS and XLSX)
- JSON (Java Script Object Notation)
- HTML (Tables)
- SQL (Databases)
- Parquet
- HDF5 (Hierarchical Data Format)
- Feather
- Stata
- SAS
- Google BigQuery
- Clipboard
- Python Dictionnaries
- URLs (HTTP, FTP, etc.)
- ... And many more

## Fake dataset

Let's create a DataFrame from a CSV file stored inside the "data" folder and named "fake.csv". We'll use this fake dataset later to demonstrate some of the Pandas functions. Let's store it in a variable named : **"fake_df"**.

**Note**: Here we're using a naming convention named **"suffix Hungarian notation"**, meaning the type of the object is included at the end of its name. And, of course, "df" is short for "DataFrame".

In [None]:
fake_df = pd.read_csv("data/fake.csv")

fake_df

If you run the cell above, you can tell right away that "fake_df" is a **DataFrame**: columns names and indices are in bold-style. And if you mouse over the DataFrame, rows are highlighted.

## Reading a CSV file : the "Survey Data"

**>>>** Use the ``pd.read_csv()`` function to read the CSV file named "survey.csv" which is located inside the "data" folder. Store the result in a new DataFrame named "df".

If you try to read a CSV file and Pandas returns an error, open the file with jupyter lab or a text editor (VS Code, Notepad++ etc.) and examine it to find the source of the error. The most common errors when reading a csv file are:

- The **filepath** was not properly given to the function. The easiest way is to move the file you want to read in the same directory than your notebook file (or in a subfolder named "data").

- A wrong **field separator**, by default Pandas assumes that it is the "," character. In this case specify the separator (= delimiter) with the argument "sep".

- A bad **"quotechar"**, a character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored. In this case specify it with the "quotechar" argument.

- Bad file **encoding**. The "utf-8" standard is the most common, but sometimes the files are in other formats like "cp1252" for example. In this case specify the encoding with the "encoding" argument.

- The presence of **extra lines** at the beginning or at the end of the file. In this case use the "skiprows" or "skipfooter" arguments to ignore these lines.

**NOTE**: Do **NOT** open the file with the sofware "Excel", it may corrupt your file and make it unreadable, even if you don't save the modifications.

In [None]:
# Code here !


# First things to do

## The `.shape` property

We now have two DataFrames "fake_df" and "df". Let's take a look at our they're shaped.

In [None]:
fake_df.shape

**>>>** What shape is our df? What does it mean?

In [None]:
# Code here!


## The ``.head()`` function

It returns the first n rows, default is set to 5.

In [None]:
fake_df.head()

**>>>** Use the ``.head()`` function to display the first 2 lines of our df.

In [None]:
# Code here!


## The `.columns` property

It stores the names of our different columns. It is also the index of the columns.

In [None]:
fake_df.columns

**>>>** What are the columns of our DataFrame? Use a ``for`` loop to print each column name on a different line.

In [None]:
# Code here!


## The `.index` property

It stores the names of the rows (the index).

**>>>** What are the rows of our df? Use a ``for`` loop to print each row index on a different line.

In [None]:
# Code here!


## The `.dtypes` property

The word `dtypes` stand for "data types", it stores the types of our different columns. The type "object" is often a string.

In [None]:
fake_df.dtypes

**>>>** What are the dtypes of our df?

In [None]:
# Code here!


## Missing values : the `.isna()` method

When you're given a new dataset, it is quite important to check if they are missing values. You can use the ``.isna()`` method, it returns a new DataFrame which has the same size than the original df, but the values are ``True`` if the value is missing and ``False`` if a value exists.

That's one of the main strength of python : the outputs of many Pandas functions are also Pandas objects, meaning you can work on your data or your results using the same functions.

In [None]:
fake_df.isna()

**>>>** Use ``.isna()`` on your DataFrame.

In [None]:
# Code here


## Apply a function to a DataFrame : ``isna().sum()``

The ``.sum()`` method performs a sum on an entire DataFrame. When performing sums, boolean values are treated as 1 if they're ``True`` and 0 if they're ``False``.

In [None]:
fake_df.isna().sum()

**>>>** How many missing values in our DataFrame ?

In [None]:
# Code here!


# Series

The output generated by ``.isna().sum()`` is called a Series. A Series is a one-dimensional labeled array-like data structure, it is sometimes referred as "column". A Series consists of two main components:

- **Values**: This is the actual values contained in the Series, which can be of any data type such as integers, floats, strings, or more complex types.

- **Index**: An index is a label or identifier associated with each data point in the Series.

DataFrames are a collection of Series, so if you slice your DataFrame using the name of the columns, you'll also get Series.

In [None]:
fake_df['letter']

You can tell it's a Series because there's only one dimension, and, unlike a DataFrame, the output is plain text.

**>>>** Slice the column "country" from your DataFrame and display the corresponding Series.

In [None]:
# Code here!


## Creating or replacing Series

Just like a dictionnary, to create or replace a Series you just have to assign it a variable (list, dict, integer, almost any objects...).

```python
df['my_new_series'] = my_value(s)
```

In [None]:
fake_df['one'] = 1
fake_df

In [None]:
fake_df['one'] = 999
fake_df

In [None]:
fake_df[['one', 'two']] = 1
fake_df

In [None]:
fake_df[['one', 'two']] = 1, 2 # An implicit tuple
fake_df

In [None]:
fake_df['three'] = fake_df['one'] + fake_df['two']
fake_df

In [None]:
fake_df['count'] = [el for el in range(fake_df.shape[0])]
fake_df

### Dropping a Series

There are several ways to "drop" (erase / remove / delete) a Series from your DataFrame, one of the easiest is just:

In [None]:
fake_df.drop(columns='one') # This function is not "in place" which means we haven't modified "fake_df" yet.

In [None]:
# If we're happy with the result
# We can replace the old df with the new one
fake_df = fake_df.drop(columns='one')

**/!\ WATCH OUT !** This time we're not replacing or creating a **Series** we're replacing the whole DataFrame!

One can get confused very easily. Luckily if you make a mistake, it's also very easy to go back and re-run the cells.

In [None]:
# We can drop several columns by passing a list.
fake_df.drop(columns=['two', 'three', 'count'])

In [None]:
# If we're happy with the result
# We can replace the old df with the new one
fake_df = fake_df.drop(columns=['two', 'three', 'count'])

# Dealing with data

## Setting the right data types

Now that we know what is a DataFrame and a Series, and before we start doing something else, it's important that our Series are converted in the right type.

In [None]:
fake_df.dtypes

### Conversion with `.astype()`

There are many different types of dtype. Some of them use standard Python format, some are specific to Pandas and others are common to several other languages (PyArrow).

Here let's use either:

- `'string'` (which is a Pandas type)
- `'category'` (also pandas type)
- `int` (python type)
- `float` (python type)

In [None]:
# One conversion
fake_df['letter'].astype('category')

In [None]:
# Several conversions
fake_df[['fruit', 'letter']].astype('category')

In [None]:
# replacing old Series with new ones
fake_df[['fruit', 'letter']] = fake_df[['fruit', 'letter']].astype('category')

## Conversion with a new import

The best practice is probably to set the datatypes when importing the data. This is specially true when we work with large databases because dtypes as `category` for example use less memory than string.

When using the funcion `pd.read_csv()`, we can give to the argument `dtypes` a dictionary with column names as keys and dtype as value. We can either type this dictionary or use dict comprehension to generate a template and then edit it.

In [None]:
{col : 'string' for col in fake_df.columns}

In [None]:
# copy / paste and edit:
d = {'letter': 'category',
     'fruit': 'category',
     'value': 'float32',
     'numbers_list': 'string',
     'date': 'string'}

In [None]:
fake_df = pd.read_csv("data/fake.csv", dtype=d)
fake_df.dtypes

**Note**: If a column is not in the dictionary provided to `pd.read_csv()`, Python will try to infer the dtype.

**>>>** Set the right dtypes for the survey DataFrame (df).

- "country", "population", "size", "pets", "sport", "colour", "show", "time" will be 'string'.
- "pop_clean", "educ_years", "salary", "siblings" can all be 'int32' to save up a little bit of memory.
- "age_group", "father_degree", "mother_degree", "gender" will be 'category'.

In [None]:
# Code here!

In [None]:
# to delet

df

### Time format

Our "df" and "fake_df" both contain dates. However if you take a look at it, they're only strings, not dates yet. Pandas can use "datetime" objects allowing us to plot and perform operations on them.

In order to do so, we can use the function ``pd.to_datetime()`` which will convert our strings to the right format. Sometimes Pandas will be able to infer automatically the date format. But in this case the strings are not standards so we need to pass a *strftime* (string format time) to the parameter "format" to tell Python what's the format date.

Each % following by a letter means Python is going to replace it with the elements it finds. The next letter that follows the "%" is code, the rest of it is just characters which will be erased when converting to the datetime format.

In [None]:
pd.to_datetime(fake_df['date'], format='%Hh:%Mm:%Ss %d-%b-%Y')

In [None]:
# Once we're happy with the result, let's create a new Series
# that will contain the date in the right format
fake_df['datetime'] = pd.to_datetime(fake_df['date'], format='%Hh:%Mm:%Ss %d-%b-%Y')

In [None]:
# We can now perform operations on this Series
fake_df['datetime'].mean()

## The Series "time"

**>>>** Create a new Series in our df named "datetime" with the function `pd.to_datetime()` and the right *strftime* that will store our dates with in datetime format.

To better understand how a *strftime* is used, move the edit cursor inside the function's name, then press "shift+tab". It will open the "doc string" which contains a lot of useful informations. There's a link inside the "format" section, follow the link, it will take you to the pandas documentation (go to the bottom of the page).

Then find the minimum value, the maximum value and the mean of the "datetime" Series we've just created.

In [None]:
print(df['time'][0]) # Checking what the first line looks like
# Code here!


In [None]:
# When you're done, replace the old Series with the new one here
# Code here!


In [None]:
# Display the min:
# Code here!


In [None]:
# Display the max:
# Code here!


In [None]:
# Display the mean:
# Code here!

**>>>** Drop the Series "time" from your df.

**/!\ WATCH OUT !** This time we're not replacing or creating a **Series** we're replacing the whole DataFrame !

- You can get confused very easily. Luckily if you make a mistake, it's also very easy to go back and re-run the cells.
- If you have several columns to drop, you can give to the method `.drop()` a list of columns.

In [None]:
# Code here!


## Plotting data

Let's go back to our fake_df. You can plot a Series like this:

In [None]:
fake_df['value'].plot(); # Adding a semicolon removes useless legend
# Note that line stops as soon as it encounters a missing value (NaN)

In [None]:
# You can specify what graph you want with the 'kind' parameter.
# Let's use a bar graph first.
# Bar graphs are usually used to plot categorical data.
# As we want to plot the value for each row of our fake_df, this would work.

fake_df['value'].plot(kind='bar');

But the xticks are the index of our DataFrame. Let's plot some values as "y" and some categorical values as "x". In order to do this, you can use ``.plot()`` directly on a DataFrame, allowing you to manipulate multiple Series easily.

In [None]:
fake_df.plot(x='letter', y='value', kind='bar');

## The Series "pop_clean"

It contains the number of inhabitants in several countries.

**>>>** Plot both the Series 'pop_clean' and 'country' on the same graph.

In [None]:
# Code here!


### Operations on numeric data

The Series **'value'** in *fake_df* and **'pop_clean'** in *df* are both numerical data. Meaning you can apply many different statistical functions on them:

In [None]:
fake_df['value'].mean()

In [None]:
fake_df['value'].median()

In [None]:
fake_df['value'].describe()

**>>>** Have a look at basics statistics on the "pop_clean" data. You can use the function `.astype(int)` to convert the result to integers.

In [None]:
# Code here!


### Displaying large float numbers

When working with large float numbers, it might be useful to change the way Pandas display them. For example :

In [None]:
# We create a function that will add commas as thousands separator,
# and then replace_them with an underscore,
# also it only takes up to 2 digits now.
def thousands_separator(x):
    return '{:_.2f}'.format(x)
pd.set_option('display.float_format', thousands_separator)

### Plotting the data

To get a better grasp at a Series, we can plot them using the method ``.plot()``. It works fine on numerical values.

In [None]:
fake_df['value'].plot(); # Adding a semicolon removes useless legend

But if we apply the same function on a categorical column, it will return an error:

In [None]:
# Yields an error!
#df['country'].plot()

### The method ``value_counts()``

This method is very useful, it takes as input almost any Series and return a new Series which displays the number of occurrences for each elements.

In [None]:
fake_df['letter'].value_counts()

## The Series "country"

In [None]:
df['country']

**>>>** Use the ``.value_counts()`` method and plot the "country" Series.

In [None]:
# Code here!


**>>>** This works, but it's not ideal. Check the [online documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) and give better arguments to the parameters of the `.plot()` method in order to display a useful graph. You're probably going to use the parameter "kind", and maybe "rot" (rotation).

In [None]:
# code here!


## String manipulation

As you can see some country names start with a capital letter, and some others are lower-case. Pandas provides many functions to work with strings. They are in a submodule called `.str`. For example:

In [None]:
fake_df['letter'].str.lower()

In [None]:
fake_df['letter'].str.replace("D", "The function can also be used!")

## Fixing the strings in the Series "country" 

**>>>** Capitalize the strings inside the Series "Country". When it's done, replace the old Series with the new one and plot the data again.

In [None]:
# Code here!


## The method `.sort_index()`

We can apply the method `.sort_index()` on a Series to reorganise the index, which is useful to better visualize results or to plot data. It uses alphabetical order for strings, and ranking for numbers.

In [None]:
# Remember this Series of random letters?
fake_df['letter']

In [None]:
fake_df['letter'].value_counts()

In [None]:
fake_df['letter'].value_counts().sort_index()

In [None]:
fake_df['letter'].value_counts().sort_index(ascending=False)

## The Series "educ_years"

**>>>** Let's take a closer look at the "educ_years" Series :
- Display basic statistics with `.describe()`.
- Plot the "educ_years" Series using a "bar" graph.
- Now plot it using a "kde" (Kernel Density Estimator) graph.
- Use ``.value_counts()`` on the Series and plot the result.
- Make sure your xticks are ascendant, using the using the method ``.sort_index()`` on your Series.

In [None]:
# Code here!


In [None]:
# Code here!


In [None]:
# Code here!


In [None]:
# Code here!

In [None]:
# Code here!


## Selecting data in Pandas : ``.iloc[]`` and ``.loc[]`` methods

The `.iloc[]` and ``.loc[]`` methods in Pandas are used for indexing and selecting data in DataFrames. They serve different purposes and work based on different indexing schemes:

### The ``.iloc[]`` method (Integer Location)

``.iloc[]`` is primarily used for selecting data by integer position, which means you specify row and column positions numerically :

- It accepts integer-based indexing for both rows and columns.
- The indexing is zero-based, similar to Python lists.
- You can use integers, slices, lists, or boolean arrays to select data.

In [None]:
fake_df.iloc[0]  # Select the first row
# Note that it returns a Series, not a DataFrame.

In [None]:
fake_df.iloc[2:5, 1:3]
# Returns a DataFrame because they are several Series.

In [None]:
fake_df.iloc[[0, 3, 5], [1, 2]]  # Select specific rows and columns by integer positions

In [None]:
# Select specific rows and columns with boolean indexing
fake_df.iloc[[True, False, True, False, True, True, False, True, False, True, False], [False, True, True, False, False, True]]

### The ``.loc[]`` method (Label Location)

The method ``.iloc[]`` can sometimes be useful, but generally we use the ``.loc[]`` method which is very powerful. It allows us to select data by label or label-based conditions.

- It accepts label-based indexing for both rows and columns.
- Unlike most of the indexing in Python : the indexing is **inclusive on both ends** (i.e., slices include the specified labels).
- You can use labels, slices, lists, or boolean arrays to select data.
- You can filter using conditions.

In [None]:
fake_df.loc[0:3, 'fruit']

In [None]:
fake_df.loc[1:2, ['fruit', 'datetime']]

In [None]:
# Just like .iloc[], you can filter rows and columns using boolean indexing
fake_df.loc[[True, False, True, False, True, True, False, True, False, True, False], [False, True, True, False, False, True]]

In [None]:
# Pandas returns a boolean Series when you make comparison
fake_df['value'] > 500

In [None]:
# You can use this boolean Series to filter your DataFrame using .loc[]
fake_df.loc[fake_df['value'] > 500]

## The Series "salary"

**>>>** Let's have a look at the "Salary" Series.

- Display basic statistics.
- Plot it.
- Choose a threshold , remove the outliers using ``.loc[]`` and compute the same stats.
- Generate a new graph without the outliers.

**TIP**: When using a hist graph, you can decide the numbers of "columns" using the "bins" parameter. Default is set to 10.

In [None]:
# Code here!


In [None]:
# Code here!


In [None]:
# Code here!


In [None]:
# Code here!

## Outliers : the thumb rule

In statistics, an outlier is an observation or data point that significantly deviates from the majority of the other data points in a dataset. Outliers can be unusual values that are either much larger or much smaller than the typical values in the dataset. They are often referred to as "extreme" values.

Sometimes, when you process a huge amont of data and you need to remove the outliers, you can use a statistician thumb rule : an outlier is a value that is above (or below) the distribution mean plus (or minus) three times the standard variation.

In [None]:
fake_df['value'] # There are 2 outliers there

In [None]:
fake_df['value'].mean()

In [None]:
fake_df['value'].std()

In [None]:
# thumb rule for outliers : mean +/- 3 times the std
# Upper bound :
print(fake_df['value'].mean() + 3 * fake_df['value'].std())
# Lower bound :
print(fake_df['value'].mean() - 3 * fake_df['value'].std())

However this "thumb rule" doesn't apply here as we don't have enough data. So let's just take the mean and add or remove the standard variation.

In [None]:
# our rule for outliers : mean +/- the std
# Upper bound :
print(fake_df['value'].mean() + fake_df['value'].std())
# Lower bound :
print(fake_df['value'].mean() - fake_df['value'].std())

## The .between() method

The ``.between()`` method allows us to create a boolean Series which returns ``True`` if the value is between the 2 arguments, and ``False`` if it's not the case.

In [None]:
fake_df['value'].between(-22, 1000)

In [None]:
# You can then filter your entire DataFrame based on that condition
fake_df.loc[fake_df['value'].between(-22, 1000)]

In [None]:
# Let's apply our "outliers policy" on our dataframe:

fake_df.loc[fake_df['value'].between(
                            fake_df['value'].mean() - fake_df['value'].std(), # lower bound
                            fake_df['value'].mean() + fake_df['value'].std() # upper bound
                                     )]

Note that our missing values in the column "value" have also disappeared.

## Creating a subset of a DataFrame

Let's store the result in a new DataFrame, which will be a **subset**, or a **view**, of our original DataFrame, just like when we slice an array with Numpy.

In [None]:
clean_values_fake_df = fake_df.loc[fake_df['value'].between(
                            fake_df['value'].mean() - fake_df['value'].std() # lower bound
                          , fake_df['value'].mean() + fake_df['value'].std() # upper bound
                                             )]

In [None]:
clean_values_fake_df.head()

## Creating a subset of "salary" without outliers

There are many ways to deal with outliers. One of the easiest way is just removing the rows that contain extreme values. However, we don't want to modify our original DataFrame.

**>>>** Create a new DataFrame called "salary_df" which will not contain the rows where the salary values are over a thresold.

In [None]:
# Code here!


In [None]:
# When you're happy with the result,
# create in this cell a new dataframe called "salary_df"
# which doesn't have any outlier
# Code here!


# Group By and Aggregations

## Group By


In data science, a "Groupby" is an operation that involves splitting a dataset into groups based on one or more criteria. It is a way to break down data into smaller, manageable pieces for analysis.

Once data are separated in different groups, we usually apply one or several functions on each different group.

## Aggregations


"Aggregations" refer to the process of applying a mathematical or statistical function to a set of data to obtain a single summary value. Aggregations typically involve operations like sum, mean, median, count, min, max, etc.

Aggregations are used to summarize and condense data, providing insights into the overall characteristics of a dataset or specific groups created using groupby.

## Exemples

Let's imagine we have a library with several books classified with their genre. We can group them and apply the function `.sum()` to check how many books we have for each different categories.

<img src="files/group_by-sum.jpg" width="90%" align="center">

But we could also apply the function `.mean()` to compute the average.

<img src="files/group_by-avg.jpg" width="100%" align="center">

[Source](https://learnsql.com/blog/group-by-in-sql-explained/)



## Performing GROUP BY on our datasets

### A simple Group By

We can create groupby objets without applying a function, and save it for later.

In [None]:
clean_values_fake_df.groupby('letter') # Data have been grouped by Letter

### Simple functions

#### ``.sum()``

In [None]:
# If we remove numeric_only=True, Pandas yield an error,
# because you can't apply this function on strings.
clean_values_fake_df.groupby('letter').sum(numeric_only=True)

#### ``.count()``

In [None]:
# Let's use a count()
# here we don't need the parameter numeric_only=True
# because count() can be applied on any object.
clean_values_fake_df.groupby('letter').count()

#### ``.mean()``

In [None]:
clean_values_fake_df.groupby('letter').mean(numeric_only=True)

### The method ``.agg()``


The .agg() method in Pandas is used to perform aggregation operations on a DataFrame or Series.

We can specify one or more aggregation functions that we want to apply to the data. These functions can be built-in functions like ``sum()``, ``mean()``, ``min()``, ``max()``, or custom functions.

It can take strings arguments, lists or even dictionaries.

#### ``.agg()`` with one function

In [None]:
clean_values_fake_df.groupby('letter').agg('mean', numeric_only=True)

#### ``.agg()`` with several functions

Here we took only the Series "value" from the grouped data, and apply three different functions to it.

In [None]:
clean_values_fake_df.groupby('letter')['value'].agg(['count', 'sum', 'mean'])

#### ``.agg()`` with a dict of arguments

Passing a dictionnary of arguments allows us to better control the behavior of the aggregation function.

In [None]:
clean_values_fake_df.groupby('letter').agg({
        'value' : ['mean','median', 'max', 'min'],
        'fruit':  ['count']})

**>>>** Let's find out if we can learn more about the participants of this survey. Using the previous "salary_df" we've created :

- Group by age and look at the max, the min, the count, and the average salary.
- Plot the average of the last result to better visualise.
- Group by gender and look at the max, the min, the count, and the average salary.
- Plot the average of the last result to better visualise.

In [None]:
# Code here!


In [None]:
# Code here!


In [None]:
# Code here!


In [None]:
# Code here!


## The `.map()` method

The .map() method in Pandas is used to transform values in a Series based on a mapping or a provided function. It is a versatile and flexible way to apply custom transformations to the data. Here's how the .map() method works:


- If you provide a dictionary, it maps existing values in the Series to new values.
- If you provide a function, it applies the function to each element in the Series.

**Output**: The result is a new Series with transformed values based on the mapping or function.

Let's see some examples:

### Using a custom function with `.map()`

Once we've created a function, `.map()` will apply the function on each element and returns a new Series.

In [None]:
# Let's define a function
def adds_1000(number):
    return number + 1000

In [None]:
# Test
adds_1000(123.53)

In [None]:
fake_df['value'].map(adds_1000)

### Using a dictionary with `.map()`

If we give a dictionary to `.map()`, it will map the different elements. If it finds the key, it will replace it with the corresponding value. Doing this is called "mapping".

In [None]:
my_mapping_dict = {'A': 'Z',
                   'B': 'Y',
                   'C': 'X',
                   'D': 'W'}

fake_df['letter'].map(my_mapping_dict)

# The Series "siblings"

There is an error in the statement of the question. Can you spot it?

>How many siblings do you have, including yourself? Example: If you have 2 sisters and 1 brother, write 4.
>Include half-siblings - or your step-mother or step-father's children-  if you grew up with them.
>If you are an only child, enter 0.

We'll need to correct the answers before using it. To do so, we'll use `.map()`.

**>>>** Use `.map()` with a custom function to correct the "Siblings" Series.
- If number of siblings is 0, set it to 1. Otherwise leave the current number.
- When it's done replace the old Series with the new one.

**TIP**: If you made a mistake you can re-run all the cells from the top to get your df as it was originally. You can do it with "shift+enter" or using the "Run all above selected cells" in the run menu.

In [None]:
# Code here!


In [None]:
# Replace the old Series with the new one in this cell
# Code here!


# The Series "age_group"

**>>>** Plot the data using a bar graph.

In [None]:
# Code here!


## The methods `.unique()` and `.nunique()`

### `.unique()` 

- The `.unique()` method is used to return an array of all the unique values present in a Series. In other words, it removes duplicate values and provides a list of unique values.

- It returns a new Series.

### `.nunique()` 

- The `.nunique()` method is used to count the number of distinct (unique) values in a Series.

- It returns the count of unique values.

In [None]:
fake_df['letter'].unique()

In [None]:
fake_df['letter'].nunique()

## Creating a new Series : "age_mean"

**>>>** Create a new Series called "age_mean" which will take the mean of each age group. There are several ways to do it:

- Use a dictionary and `.map()` (quick and simple way).
- Use `.unique()` to get all the unique strings, convert the elements in integer and compute the mean. You can either use a custom function or a comprehensive dictionary. To compute the mean, you can use `np.mean()` (more complicated).

In [None]:
# Code here!


# Parents level of education

These are ordinal values, but they are stored as strings. Let's make things right.

In [None]:
df['father_degree']

In [None]:
df['mother_degree']

**>>>** Create new series with integer instead of strings.

- The two new Series will be named "father_el" and "mother_el" ("el" = "education level").
- Create a dictionary and use the `.map()` function to convert the strings to integer. The values will be :
    - If the string is Do not wish to answer / can't answer", replace it with `None`.
    - If it is "No diploma" then 0.
    - If it is "French baccalaureate" then 1,
    - If it is "Licence" then 2,
    - If it is "Master or PhD" then 3 
    
- At the end create a new Series named "parents_el" which will be the sum of "father_el" and "mother_el".

- Once you've created "father_el" and "mother_el", convert them to an 'int' type.

In [None]:
# Code here!


# The Series "size"

Let's say we want to know what is the average size of the people who took the survey. We first need to clean and convert those strings in integer.

In [None]:
df['size']

**>>>** Write a custom function that clean the Series.

- Apply it using `.map()`
- Replace the old Series with the new one.

In [None]:
# Code here!


# The Series "gender"

**>>>** Create a new Series called "gender_int", and assign 1 if the gender is "Woman" and 0 if the gender is "Man". Convert this Series to 'int'.

In [None]:
# Code here!


## The function `.str.split()`

This function belongs to the submodule `.str`. It behaves almost the same way than the `.split()` function.

In [None]:
fake_df['numbers_list'].str.split()

In [None]:
fake_df['numbers_list'].str.split()[0][0]

In [None]:
fake_df['numbers_list'].str.split('-')[0][0]

# The Series "pets"

This Series has an issue. Several data are stored in the same string. Let's separate them.

In [None]:
df['pets']

**>>>** To deal with this data, we'll need to: 

- Put all the strings in lower case.
- Find the right separator.
- Split our data in three different Series named "pet1", "pet2" and "pet3". In order to do this use the parameter "expand"

**TIP** : You can give a list of Series to a Dataframe. For example:

```Python
fake_df[['letter', 'value']]
```
It might be useful to create new Series in a single line of code.

In [None]:
# Code here!


In [None]:
# Use this cell to create the 3 Series "pet1", "pet2" and "pet3"
# Code here!


# Correlation

Let's see if we can find any correlation in our dataset.

## Prepare the dataset

First let's create a view of our df with only the variables of interest such as:

- 'pop_clean'
- 'educ_years'
- 'salary'
- 'siblings'
- 'size'
- 'gender_int'
- 'age_mean'
- 'parents_el'

### The method `.copy(deep=True)`

When we slice our DataFrame, it creates a **view**, not a **copy**. However you can force Pandas to create a distinct copy using `.copy(deep=True)`

In [None]:
new_fake_df = fake_df.copy(deep=True)
new_fake_df['letter'] = "Z"
new_fake_df['letter']

In [None]:
# Original df hasn't been modified
fake_df['letter']

**>>>** : Create a deep copy of our DataFrame named "corr_df" which contains only those columns of interest. Then apply the `.corr()` method on it.

In [None]:
# Code here!


## An other way to deal with outliers.

We still have some outliers, especially inside the Series "salary". There are countless ways to remove them, let's use `np.where()`.

### `np.where()`

This function provides a way to perform conditional operations on NumPy arrays and Pandas Series. It allows you to generate a new array or a new Series based on a specified condition. The general syntax for `np.where()` is:

```python
np.where(condition, value_is_true, value_if_false)
```

In [None]:
# Exemple

np.where(fake_df['value'] > 0, 10, 0)

In [None]:
# Exemple

# np.set_printoptions(precision=2, suppress=True) # useful to better visualize
np.where(fake_df['value'] > 0, 10, fake_df['value'] / 100)

In [None]:
# Creating or replacing a Series

fake_df['new_value'] = np.where(fake_df['value'] > 0, 10, fake_df['value'] / 100)

**>>>** Replace the two outliers values we have in the "salary" Series with the median of that Series.

In [None]:
# Code here!


In [None]:
# Let's display the ne result
corr_df.corr()#.round(2)

# Correlation visualisations

When dealing with data, it's always a good thing to look at the data and not just numbers.

## Anscombe's quartet

Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. ([wikipedia](https://en.wikipedia.org/wiki/Anscombe%27s_quartet))


<img src="files/anscombe.png" width="70%" align="center">


| Property                                                  | Value             | Accuracy                                |
|-----------------------------------------------------------|-------------------|-----------------------------------------|
| Mean of x:                                                | 9                 | exact                                   |
| Sample variance of x:                                     | 11                | exact                                   |
| Mean of y:                                                | 7.50              | to 2 decimal places                     |
| Sample variance of y:                                     | 4.125             | ±0.003                                  |
| Correlation between x and y:                              | 0.816             | to 3 decimal places                     |
| Linear regression line:                                   | y = 3.00 + 0.500x | to 2 and 3 decimal places, respectively |
| Coefficient of determination of the linear regression: R² | 0.67              | to 2 decimal places                     |



## Seaborn and heatmap

Let's use Seaborn, a libray built on top of matplotlib.

In [None]:
import seaborn as sns
sns.heatmap(corr_df.corr());

In [None]:
# let's annot the heatmap and make it range from -1 to 1.
sns.heatmap(corr_df.corr(), vmin=-1, vmax=1, annot=True);

### The `sns.pairplot()` function

This function is used to create a matrix of scatterplots, also known as a pairwise scatter plot matrix. It's a valuable tool for visualizing the relationships between multiple variables (columns or Series) in a DataFrame.


In [None]:
sns.pairplot(corr_df, diag_kind='kde', kind='reg', plot_kws={'color': 'red'});

In [None]:
sns.lmplot(x='parents_el',
           y='salary',
           data=corr_df,
           fit_reg=True,
           line_kws={'color': 'red'}
          );

In [None]:
sns.lmplot(x='gender_int',
           y='salary',
           data=corr_df,
           fit_reg=True,
           line_kws={'color': 'red'}
          );

# Advanced cleaning

Population, colour and TV Show can be cleaned using more complicated techniques.

## Population

In [None]:
df['population']

**>>>** Write a custom function and use `.map()` to clean the column population.

In [None]:
# Code here!


# The Series "sports"

**>>>** Use this [list](https://gist.githubusercontent.com/stefanoverna/371f009900bbe9ceec208f5dd1688737/raw/db7a90fa9e5dcb4ec22f4aef2774348fff7ccf69/gistfile1.txt) found on github, and the function `SequenceMatcher` from the `difflib` library to autocorrect sport names. Name your function "correct".

In [None]:
# Code here!


# The Series "colour"

**>>>** Find a list of colours on the internet and, just like the previous question, find a way to correct the data. You can reuse functions you've already written.

In [None]:
df['colour']

In [None]:
# Code here!


# TV Show

### Scrapping

**>>>** Use scrapping to create a list of TV Show. And then correct the data.

In [None]:
# Code here!
