# Python for Data Science Practice Session 1 : Mathematics and Statistics

## A walk through history

In this session, we are going to analyse a dataset containing Mathematicians that have helped shape what Maths - as well as many other fields - have become today. This is a great chance for you to learn more about previous Mathematicians starting from where they're from, the fields they were interested in, the occupations they had and many other stuff that you can explore.

We start by importing the libraries that we are going to need for this project:

In [2]:
# Import pandas  


In [3]:
# This is a library that is used in one of the pre-written codes
 

In [4]:
# Upload the file named mathematicians.csv into a dataframe


In [5]:
# View the dataframe


You might be wondering what NaN stands for. Well, NaN is an abbreviation for "Not a Number", and it is a member of the numeric data type but it is undefined or unrepresentable on the computer. Take 0/0 as an example, which is undefined. NaNs are treated as missing values by Python. 

This dataset is a great example on how real life datasets might look like, where datasets are not perfect and you will have to find a way to deal with those missing values.

Anyways, let's proceed for now with getting a taste of what the dataset looks like and we'll get back to dealing with missing values in a bit.

In [6]:
# Output the dimensions of the dataframe


In [7]:
# Output the data of 10 random mathematicians


Not all the column headings are visible in the previous output due to the large amount of columns. 

Let us output the column headings:

In [8]:
# View the column headings


In [9]:
# Drop the following columns: 'Erdős number' , 'instance of', 'approx. date of birth' and 'approx. date of death'


In [10]:
# Output the number of columns


- - - - 

## Let's talk about missing values

So we have a total of 25 columns. Let's check how many missing values each column has individually.

In [11]:
# Output the total number of missing values per column


 - - - - - -
As you can see, most of the columns have a large amount of missing values which is an issue. After all, the analysis you make is as good as the data that you have. What this means in our case is that due to the large amount of missing values per column, the analysis that we do might not be of the best quality.

One of the problems that we could face is having a pattern in the missing values. An example would be that older Mathematicians might have less information associated with them compared to the most recent ones (I will leave this for you to explore on your own to check and see if this is the case here if you are interested). As a result, this might introduce bias in your analysis, where the outputs you get from your analysis are not the best representatives for the older mathematicians because most of the data used in your analysis will be associated with recent Mathematicians, hence biased towards recent Mathematicians.

What we would like to have is a missing value free dataset with a broad variety of Mathematicians. 
- - - - - - - - - - - - - - -
In the next task, I would like for you to pick a random mathematician and fill in the missing values by doing a quick research. 


In [12]:
# The following steps that you should follow are: 

# 1) Pick a random mathematician 
# 2) Do a quick research on them
# 3) Fill in their missing data manually in the dataframe







In a real world case scenario, we will not fill in each missing value manually as it is extremely in-efficient.

A solution to this is called Web Scraping, which is the technique used to automatically scrape data from the internet using Python libraries, with the most popular one being called Beautiful Soup. It is extremely powerful in the sense that it is able to collect large sums of data by the click of a button, saving the massive amount of time that is needed to collect the data manually.

Even though Web Scraping is beyond the scope of this course, I would recommend reading more about it on your own. You can find more about Web Scraping and Beautiful Soup here: https://realpython.com/beautiful-soup-web-scraper-python/ 

Keep in mind that there are different types of missing values (eg: numeric missing values) and different ways of dealing with them. We will return back to this topic in the upcoming sessions.
- - - - -

----


## List of occupations

In [13]:
# View the occupation column


Lots of the Mathematicians worked multiple occupations as you can see. Let us analyse the different occupations in the dataset.

In [14]:
# View the unique entries in the occupation column, then save it in a list called unique_occ


In [15]:
# View unique_occ


the occupation "mathematician" is repeated in both ['mathematician'] and ['publisher', 'mathematician'] in unique_occ. This is because the entries are different, and so they are treated as unique results. This deems the <b>unique()</b> function useless in outputting a list of the unique occupations. 

In the following tasks, we will work towards outputting a list of the unique occupations.

In [14]:
# This code created a list called flat_list which takes each entry in df.occupation.unique(), splits each occupation  
# from the other occupations and puts it in its own entry
item_list = []
for item in unique_occ:
    item = item.split(",")
    item_list.append(item)
flat_list = itertools.chain(*item_list)
flat_list = list(flat_list)

In [16]:
# Print the first 30 entries of flat_list


In [31]:
# Compare the length unique_occ with flat_list 
print("Length of unique_occ is", '_____' , "while length of flat_list is", '_____' )

Length of unique_occ is _____ while length of flat_list is _____


As you can see in the first 30 entries of flat_list, the format still has a few issues. For example, we have 'mathematician', ['mathematician'] and 'mathematician'] which all represent the same thing but are written in different ways. They are going to be treated by Python as distinct entries when trying to filter out the distinct values from the list.

To solve this, we will loop over the entries in flat_list and use a function called <b>replace</b> that will help remove unwanted characters in order for all of the entries to have a unified format. Read more about how to use it here: https://www.w3schools.com/python/ref_string_replace.asp 

I would recommend solving this issue by removing each unwanted character individually, checking out how the output looks like afterwards, identifying another character that needs to be removed and then repeating the process until we end up with a unified format for all the entries. This helps in making the process much simpler. 

One thing to keep in mind is that Python is really sensitive to cases and spacings. So, 'Mathematician' isn't the same as 'mathematician', and   '[' is not the same as  ' [' .


<b> HINT:</b> Writing ' ' in the second parameter of the function <b>replace</b> will remove the character instead of replacing it with another character.

In [19]:
# Loop through flat_list and use replace to remove unwanted characters, then save every new entry
# in the new list called flat_list2


In [20]:
# Print out the first 30 entries of flat_list2


Problem solved! Now use the function called <b>set</b> to retrieve the unique occupations from flat_list2. Read more about how to use it from here: https://www.geeksforgeeks.org/python-set-method/

In [21]:
# Retrieve the unique occupations from flat_list2


In [22]:
# Print out the length of the list of unique occupations


In [23]:
# Print out the list of unique occupations


Apart from having some unusual entries in the list of unique occupations (which we will ignore for the sake of time), it has a broad variety of occupations which shows how flexible Mathematicians are.
- - - - -

## Analysing gender distribution

Now let's take a look into the difference between the number of male and female mathematicians:

In [24]:
# Print out the unique entries in the "sex or gender" column


It seems to produce some weird entries like ['male', 'Swedish Wikipedia', 'Virtual International Authority File', 'Italian Wikipedia'] containing info that should not be in the gender column. This is definitely not the format that we want the gender column to have.

In the following tasks, we are going to modify the dataset to match what we need.

Looking closely at the unique entries in "sex or gender", we either have <b>male</b>, <b>female</b>, <b>intersex</b> or <b>NaN</b>. Some of the entries have those words with some other unwanted entries, but we know that each entry contains one of those 4 words.  This makes it easy to change every entry in the column so that it is either "male", "female", "intersex" or "not specified". 
 
We use a function called <b>str.contains</b> which checks every string entry in a list on whether it contains a specific word (or generally a combination of characters of interest) and outputs a list with each entry being either a True or False, where True represents the existence of the word in the entry, and False the absence.

Read more about it here: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html and use it in order to create four new boolean columns in the dataframe: "male", "female", "intersex" and "not specified".

In [23]:
# Create the three new columns specified above in the dataframe
df['sex or gender'] = df['sex or gender'].fillna('') #This replaces the NaN values with empty values in order to be able to apply str.contains

In [25]:
# Count the number of males


In [26]:
# Count the number of females


In [27]:
# Count the number of intersex


In [28]:
# Count the number of not specified


In [30]:
# Calculate the proportions of male and female mathematicians in the dataset
print('The proportions are:')
print('Male =', '_____')
print('Female =', '_____' )

The proportions are:
Male = _____
Female = _____


Now that is a big difference between the number of male and female Mathematicians. This is a clear indication that the data is biased towards male Mathematicians. 

Do you think that this might be a problem when using the data for predictions and drawing conclusions? Have a think about it.

We now create a new column called gender that has one of the following values: male, female, intersex or not specified.

In [29]:
# This code creates the gender column with the format that we want
gender = []
x = range(df.shape[0])
for i in x:
    if df['sex or gender'].str.contains("'male'")[i] == True:
        gender.append('male')
    elif df['sex or gender'].str.contains("'intersex'")[i] == True: 
        gender.append('intersex')
    elif df['sex or gender'].str.contains("'female'")[i] == True:
        gender.append('female')
    else: 
        gender.append('not specified')

In [30]:
# Replace the 'sex or gender' column with 'gender'


In [32]:
# Preview the 'sex or gender' column


All solved! Now that we formatted the 'sex or gender' column, we can get rid of the boolean columns 'male', 'female', 'intersex' and 'not-specified'

In [33]:
# Remove the following columns: 'male', 'female', 'intersex' and 'not-specified'


## Analysing the dataset

In [34]:
# Are there any rows that do not have any missing values? (Output True or False)


In [35]:
# Output the rows that have at least 10 missing values


In [36]:
# Output the rows of the mathematicians that have the year of birth filled in, name the list of rows x


The following tasks are going to require the knowledge of a function called <b>str.contains</b> which is mentioned and explained in the section named <b>Analysing gender distribution</b> in this notebook. 

In [37]:
# Check the datatype of the 'year of birth' column


We would like to have the data type of the entries of the 'year of birth' column to be numeric in order to be able to perform mathematical operations that will be needed in the following tasks.

In the next task, convert the data type of the column 'year of birth' to numeric. Do you run into a problem? Inspect the column entries to find out what is causing it and fix it. 

Note: You can either edit the entries that are causing the problems, or remove them entirely.

In [38]:
# Convert the datatype of the column 'year of birth' to numeric


In [39]:
# Output the row of the oldest mathematician in x


In [40]:
# Output the row of the most recently born mathematician in x


In [41]:
# Output the range of the year of birth column


In [42]:
# Output the rows of the mathematicians that have studied at the University of Cambridge


In [43]:
# How many mathematicians were interested in any of the fields of Analysis?


## Extra task

Here is a link for you to check out that shows the investigations that the owner of this dataset performed: https://www.kaggle.com/joephilleo/investigating-the-mathematicians-of-wikipedia