# Fundamentals of Data Analysis Tasks

## Rebecca Feeley

*** 

### Task 1 
> To verify, using Python, that the Collatz conjecture is true for the first 10,000 positive integers.

The Collatz Conjecture is a mathematical problem which has not been definitively solved as of yet. The Collatz Conjecture is based on the below equation:  

\begin{equation*}
f(x) =
\begin{cases}
x \div 2 & \text{if } x \equiv 0 , \text{ if x is even}\\ 
3x+1 & \text{if } x \equiv 1 , \text{ in all other cases}.
\end{cases}
\end{equation*}

In simpler terms, the Collatz Conjecture is that if you pick any positive integer, and if its even, divide it by 2; if its odd, multiply it by 3 and add 1. This process should be repeated with the resulting number and it should eventually reach 1. The Collatz Conjecture states that no matter what positive integer is used, the result will always eventually reach 1.

The task here is to verify that the Collatz Conjecture holds true for the first 10,000 positive prime integers. 


**The first set of code is to demonstrate the Collatz Conjecture itself by asking the user to input a positive integer and demonstrating how the conjecture will eventually reach 1 for this number.**

In [None]:
 # Asks user to input  positive integer and outputs successive values of the following calculation:
# if even, divide by 2; if odd, multiply by 3 and add 1

number = int(input("Enter any positive integer: ")) # Asks user to input a number and converts it to an integer
collatz_numbers = [] # a list is created to which each number will be added

while number <= 0: # While loop is created to catch any negative numbers inputted so program runs properly
    print ("The number inputted must be a positive integer.")
    number = int(input("Please enter a positive integer: "))

def collatz(number): # I created a function to carry out the calculations required by the Collatz Conjecture
    if (number % 2) == 0: # the number entered is checked using modulus to see if it is even. If it is even, 
        number = (number // 2) # this code runs and the number is divided by 2 and added to collatz_numbers list
    elif (number % 2) == 1: # else if the number entered is odd,
        number = (number * 3 + 1) # this code runs and the number is multiplied by 3 and 1 is added. 
    return (number)

while number != 1: # this creates a while loop that runs until the output equals 1
    collatz_numbers.append(number) #Its appended to collatz_numbers list 
    number = collatz(number)
    # this loop continues until the number 1 is the outcome. 
    # The outputs of the if loop collatz function are appended to the collatz_numbers list
collatz_numbers.append(number)
print(*collatz_numbers) # the '*' is used to unpack the collatz_numbers list into seperate arguments and prints them out



### The above code demonstrates the Collatz Conjecture in action to the user. Now, I will verify the Collatz Conjecture for the first 10,000 positive integers. 

In [None]:
def collatz(number): # I created a function to carry out the calculations required by the Collatz Conjecture
    if (number % 2) == 0: # the number entered is checked using modulus to see if it is even. If it is even, 
        number = (number // 2) # this code runs and the number is divided by 2 and added to collatz_numbers list
    elif (number % 2) == 1: # else if the number entered is odd,
        number = (number * 3 + 1) # this code runs and the number is multiplied by 3 and 1 is added. 
    return (number)

#collatz_numbers = []
#for i in range (1, 10001):
    number = i
    while number != 1: # this creates a while loop that runs until the output equals 1
        collatz_numbers.append(number) #Its appended to collatz_numbers list 
        number = collatz(number)
    # this loop continues until the number 1 is the outcome. 
    # The outputs of the if loop collatz function are appended to the collatz_numbers list
#collatz_numbers.append(number)

#print(*collatz_numbers)


collatz_dict = {} # here I am creating a dictionary to store the values
for i in range(1, 10001):
    number = i
    collatz_sequence = []
    while number != 1:
        collatz_sequence.append(number)
        number = collatz(number)
    collatz_sequence.append(number)
    collatz_dict[i] = collatz_sequence

for key, value in collatz_dict.items():
    print(f"Starting Number: {key}, Sequence: {value}") # shows all of the values down until 1, thus proving the Collatz Conjecture 

The End of Task 1

***

## Task 2

> Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale. 

![Penguins](https://www.gabemednick.com/post/penguin/featured.png) 


The Palmer Penguins Data Set is a data set which has been provided by Dr Kristen Gorman and the Palmers Station, Antarctica Long Term Ecological Research (LTER) Program and it is often presented as an alternative to the famous Iris Data Set.
Dr Gorman and the LTER Program studied 344 adult penguins of 3 species: Adélie, Chinstrap and Gentoo penguins.
These 344 penguins were studied across three islands on the Palmer Archipelgao near Palmer Station, Antarctica.
Dr Gorman and the LTER program collected various types of information on the penguins, which we will see below.


### The different types of data

Data can be separated into 2 primary classes: quantitative data and qualitative data. 

- Qualitative data is also known as categorical data. Categorical data is data which cannot be easily counted or measured using numbers. They hold no numerical value through which they can be ranked. Rather, the variables can only be sorted by category. Examples include car brands, gender, and hair colour. 

- Quantitiative data, is also known as numerical data. This is data which can be quantified, i.e it has an intrinsic numerical value through which it can be measured. Thus, numerical data can be expressed in numbers and thus ranked. 
Examples include height, speed, temperature etc.

The above 2 classifications can be futher separated into 4 sub-types of data: nominal, ordinal, discrete and continuous data.

_Nominal_ data is a subset of qualitative data. It contains variables which do not have an intrinsic value through which they can be ranked or ordered. They are not quantifiable and cannot be measured numerically. 
For example, hair colour does not have naturally have a numerical value through which it can be ranked.

_Ordinal_ data is also a subset of qualitative data. It contains variables which are in an order (or sequence) due to the relation amongst the different categories i.e they have a natural ordering on a scale but stil maintain their class of value. This differentiates it to nominal data.
However, ordinal data does not have a numerical value and so while it can be assigned a number in order to show its position, it cannot be used to do arithimetic. For example, the top three finishers in a marathon.

_Discrete_ data is a subset of quantitative data. It contains variables which are numerical, but only includes integers which have a limited possible amount of values. They cannot be further divided into smaller parts. 
A clear example of this is the days of the month. 

_Continous_ data is also a subset of quantitative data. It contains variables which can be broken down into even smaller values. It does not have to be only integers, as is the case with discrete data. An example includes weight of humans, which can be broken down to its fractional value.
Continous data can be even further subdivided into interval and ratio data. 
Interval data is data which can be categorised, ordered and evenly spaced (even intervals) but does not have a natural zero point.
Ratio data is data which can be categorised, ordered, evenly spaced and also have a natural zero point.

### _Description of the columns/data in the data set_ 
The Palmer Penguins data set contains 344 rows and 7 columns. The variables in the columns of the data set are as follows:

Species: The species refers to one of three penguin species studied (Adelie, Gentoo or Chinstrap).   
Island: This refers to the island upon which the penguin was observed (Biscoe, Dream or Torgersen).   
Bill Length (in mm): The length of the peguin's bill in millimeters.  
Bill Depth (in mm): The depth of the penguin's bill in millimeters.  
Flipper Length (in mm): The length of the penguin's flipper in millimeters.  
Body Mass (in grams): The body mass of the penguins in grams.  
Sex: This refers to the sex of the penguin (either male or female).  

### _Classifying each of the variables in the Penguins Data Set_

The variables in the data set can be separated into two primary classifications: categorical (or qualitative) and numerical (or quantitative).

The variables Species, Island and Sex are all categorical nominal data types. In all three of these variable types, there is no instrinic numerical value or method of ordering them. For example, the species type Adelie, Gentoo or Chinstrap cannot be ordered as there is no value through which to order them.

The variables Bill Length, Bill Depth, Flipper Length and Body Mass are all numerical continous data types. FOr example, these variables can be broken down further than just integer values, and can express any value including fractional values.

Furthermore, as all of the continous data types have a defined natural zero point, i.e all of these variables have a set 0 as none of these measurements can be less than 0, these variables are also ratio data.


In [None]:
import pandas as pd
import numpy as np

Penguins = pd.read_csv('penguins.csv')
Penguins.describe()

Penguins.info()

The above code describes the 3 categorical data types as objects; it describes the 4 numerical data types as floats.

### _What variable types should be used to model these variables in Python_

I believe that it is best for the species and island variables to be stored in Python as strings as these data types can not be used to complete arithimetic. While some may propose assigning numerical values to each category, such as 1 for Biscoe, 2 for Dream, I would not recommend this approach as these assigned numerical values may get confused with actual numerical data with which arithemtic can be performed. Also, it increases the likelihood of an error in the Python code as the values assigned to the species must be consistent throughout the code. While the Sex variable slightly differs in that there is only 1 of 2 possibilities (i.e it is dichotomous) and so it could be stored as a bool data type, I think that it is better to keep a more uniform approach to the nominal data and so it should be stored as a string. This will help prevent errors in the code.

The Bill Length, Bill Depth, Flipper Length and Body Mass variables should be stored as floats in Python as they are continous quantitative data and so their measurements will need to be very precise, to the exact millimeter. While this could lead to rounding errors, it is important that these variables are stored as floating point numbers to allow arithemetic and analysis to be performed on them.

While the Sex variable could be stored as a string like the other nominal data, I think that the more appropriate data type would be the bool data type in Python 

The End of Task 2 

***

## Task 3

> For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list is the most appropriate to model the variable. 

The appropriate distribution type can be separated based on whether they are categorical variables vs numerical variables.
Firstly, I will look at the categorical variables.

The qualitative variables Species, Island and Sex are discrete, distinct variables. Such variables are not continuous in nature and are not numerical. As such, they do not fit the numpy distributions which apply to continous data such as normal distribution, gamma distribution etc. Therefore, the appropriate numpy probability distribution would be a customised categorical distribution i.e numpy.random.choice. 
This function generates a random sample from a given 1-D array and requires us to specify the porbability of each category variable.
This can be obtained using the value counts function as I have demonstrated below.
Such categorical variables are best visualised using bar charts or pie charts.

In [None]:
# importing all of the relevant libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

In [None]:
# importing all of the relevant libraries to analyse the data set 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import pandas as pd 

Penguins = pd.read_csv('penguins.csv') # reading the penguins data set to use (should already be done)
Penguins.describe(include = 'all') # I specified to include all as the describe function defaults to only show numerical data while 
# I wanted to include both numerical & categorical data.

# I have downloaded the csv data file to the directory. It can also be gotten online at https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv

In [None]:
# Species is a categorical variable 
plt.figure(figsize = (10, 7))
sns.countplot (x = 'species', data = Penguins)
plt.title ('Frequency of Penguin Species')
plt.xlabel ('Species')
plt.ylabel ('Count')
plt.show()

prob_species = Penguins['species'].value_counts(normalize=True) # getting the probability of each penguin species

Species = ['Adelie', 'Gentoo', 'Chinstrap']
species_probabilities = [0.442, 0.360, 0.198] # probabilities for each species as set out by above prob_species value counts function

fig = plt.figure(figsize =(10, 7)) #setting the size of the figure
plt.pie(species_probabilities, labels = Species, autopct='%1.1f%%') # double percentage sign is required to show % sign
# using auto percentage function
plt.legend(title = "Species") # setting legend
plt.title('Penguin Species Distribution') # setting title
plt.show()


The above resulting probabilities can then be used in the numpy.random.choice function, where the probability of each variable is specified e.g Adelie is 0.4418. The probabilties must add up to 1 for this function to work.

In [None]:
# Island is a categorical variable
colors = ['red', 'yellow', 'indigo'] # defining a list of colors
plt.figure(figsize=(10, 7))
sns.countplot(x='island', data = Penguins, palette = colors) # using palette function to set colors of count plot
plt.title('Count of Penguins on Each Island')
plt.xlabel('Island')
plt.ylabel('Count')
plt.show()

prob_island = Penguins['island'].value_counts(normalize=True) # getting the probability of each island


Islands = ['Torgersen', 'Biscoe', 'Dream']
islands_probabilities = [0.15, 0.49, 0.36] # probabilities for each island as set out by above prob_island value counts function


fig = plt.figure(figsize =(10, 7)) #setting the size of the figure
plt.pie(islands_probabilities, labels = Islands, autopct='%1.1f%%', colors = colors) # double percentage sign is required to show % sign
# using auto percentage function; setting same colours as above countplot for island category
plt.legend(title = "Islands") # setting legend
plt.title('Island Distribution') # setting title
plt.show()

In [None]:
# Sex is a categorical variable
colors = ['blue', 'pink']
plt.figure(figsize=(10, 7))
sns.countplot(x='sex', data= Penguins, palette = colors)
plt.title('Count of Penguins by Sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.show()

prob_gender = Penguins['sex'].value_counts(normalize=True) # getting the probability of the penguin gender


Genders = ['Male', 'Female']
gender_probabilities = [0.505, 0.495] # probabilities for gender as set out by above prob_gender value counts function

fig = plt.figure(figsize =(10, 7)) #setting the size of the figure
plt.pie(gender_probabilities, labels = Genders, autopct='%1.1f%%', colors = colors) # double percentage sign is required to show % sign
# using auto percentage function; setting same colours as above countplot for island category
plt.legend(title = "Genders") # setting legend
plt.title('Gender Distribution') # setting title
plt.show()

Now, I will examine which are the most appropriate probability distributions to model the quantitative variables, flipper length, bill length, bill depth and body mass of the penguins.


The end of Task 3

***

## Task 4

> Suppose you are flipping two coins, each with a probability p of giving heads. Plot the entropy of the total number of heads versus p.


### What is Entropy
The concept of entropy as it applies to information was first introduced by the famous computer scientist and mathemetician, Claude Shannon in his 1948 paper, "A Mathematical Theory of Communication".
Essentially, entropy can be considered a measure of uncertainty. For a random variable, it is considered the degree of uncertainty as to the possible outcomes of the variable.

Information entropy as expressed by Claude Shannon can be demonstrated in the following equation:

\begin{equation*}
H(x) =
\begin{cases}
-\sum_{i=1}^{n} P(x_i) \log_2 P(x_i).
\end{cases}
\end{equation*}

### Entropy & Coin Toss
As we have established above, entropy can be considered a measure of uncertainty. 

When looking at one fair coin toss, the probability of the outcome being heads is as follows:
Probability = (Number of desired results) / (Number of possible Results) so P = 1/2 for heads.
As the number of desired results is 1 (heads), divided by the number of possible results (which is 2; heads or tails), 
the probability of getting heads in a single fair coin toss is 1/2, or 0.5 (H or T is possible)
In this instance, the entropy of the single coin toss is one bit. Any subsequent coin tosses do not deliver any information which impacts the probability of further coin tosses as they are independent events, and getting either outcome remains equally likely.

However, when flipping two fair coins there are 4 possible outcomes and for each specific outcome there is one desired result (1/4):
HH (Heads Heads) -> Probability is 0.25
TT (Tails Tails) -> Probability is 0.25
HT (Heads Tails) -> Probability is 0.25
TH (Tails Heads) -> Probability is 0.25

As the task above specifies the chance of getting heads, the number of desired outcomes is 3 (HH, HT and TH) as they all have at least one heads outcome. There are 4 possible outcomes. So, the probability of getting at least one head is 3/4 or 0.75 when flipping two coins. (i.e 0.25 + 0.25 + 0.25 = 0.75)

In [None]:
import numpy as np # importing the necessary library
import matplotlib.pyplot as plt

def H(p): # defining the function for entropy using numpy
    return -(1-p) * np.log2(1.0-p) - p * np.log2(p) # version in numpy of above formula for entropy


In [None]:
def Heads(pHeads): # defining a fucntion for the entropy of a coin toss with the probability p for heads
    totalEntropy = -(1-pHeads) * np.log2(1-pHeads) - pHeads * np.log2(pHeads) # using the above formula inputting the appropriate values
    return totalEntropy

fig, ax = plt.subplots(figsize=(10,7)) # setting the size of the plot and axis

pValues = np.linspace(0.00000001, 0.99999999, 1001) # cannot use 0 here as the logarithm of 0 is undefined so this stops us getting an error

# Calculate entropy for each probability of getting heads
entropies = [Heads(p) for p in pValues]

ax.plot(pValues, entropies, label = 'Entropy')
plt.grid(True) # adding a grid to the plot by setting it as true
ax.set_title('Entropy of Two Fair Coin Tosses') # setting the title
ax.set_xlabel('Probability of Heads (p)') # setting the label of each axis
ax.set_ylabel('Total Entropy (in bits)')
plt.legend() # adding a legend
plt.show()

The above plot shows how entropy changes with the probability. For example, the higher the entropy, the higher the level of uncertainty. The closer the proability gets to certainty, e.g when the probability equals 1, the lower the entropy.


The end of Task 4

***

# Task 5

> Create an appropriate individual plot for each of the variables in the penguin data set.

In [None]:
Penguins = pd.read_csv('penguins.csv') # reading the penguins data set to use (should already be done)
Penguins.describe(include = 'all') # I specified to include all as the describe function defaults to only show numerical data while 
# I wanted to include both numerical & categorical data.


Penguins.isnull().sum() # checking for any empty values???

# There are missing values in bill length, flipper length, bill depth, sex and body mass. 

penguins_complete = Penguins.copy().fillna(Penguins.mode(0).iloc[-1]) # I have used the mode() funciton to replace the missing values with the most frequent values.
# setting the data set as a new variable called penguins_complete
penguins_complete.isnull().sum() # checking empty values has been updated

Firstly, I have created individual plots for the qualitative data, sex (gender), species and island.

## 1. Gender

In [None]:
colors = ['blue', 'pink'] # setting the colour palette for the count plot
plt.figure(figsize=(10, 7))
sns.countplot(x='sex', data= penguins_complete, palette = colors)
font1 = {'family':'serif','color':'black','size':20}
plt.title('Count of Penguins by Sex', fontdict = font1)
plt.xlabel('Sex')
plt.ylabel('Count')
plt.show()

## 2. Species

In [None]:
prob_species = penguins_complete['species'].value_counts(normalize=True) # getting the probability of each penguin species
Species = ['Adelie', 'Gentoo', 'Chinstrap']
species_probabilities = [0.442, 0.360, 0.198] # probabilities for each species as set out by above prob_species value counts function

fig = plt.figure(figsize =(10, 7)) #setting the size of the figure
explode = (0.0, 0.1, 0)
font1 = {'family':'serif','color':'black','size':20}
plt.pie(species_probabilities, explode = explode, labels = Species, autopct='%1.1f%%') # double percentage sign is required to show % sign
# using auto percentage function
plt.legend(title = "Species") # setting legend
plt.title('Penguin Species Distribution', fontdict = font1) # setting title
plt.show()


## 3. Islands

In [None]:
colors = ['red', 'yellow', 'indigo'] # defining a list of colors
plt.figure(figsize=(10, 7))
sns.countplot(x='island', data = Penguins, palette = colors) # using palette function to set colors of count plot
font1 = {'family':'serif','color':'black','size':20}
plt.title('Count of Penguins on Each Island', fontdict = font1)
plt.xlabel('Island')
plt.ylabel('Count')
plt.show()


Now, I will look at the quantitative data: bill length, bill depth, flipper length and body mass.

## 4. Bill Length (in mm)

In [None]:
sns.histplot(data = penguins_complete, x = "bill_length_mm")
font1 = {'family':'serif','color':'black','size':20}
plt.title('Bill Length of Penguins (in mm)', fontdict = font1)
plt.xlabel('Bill length')
plt.ylabel('Count')
plt.show()

## 5. Bill Depth (in mm)

In [None]:
sns.histplot(data = penguins_complete, x = "bill_depth_mm", kde = True) # setting kde = true to show distribution curve
font1 = {'family':'serif','color':'black','size':20}
plt.title('Bill Depth of Penguins (in mm)', fontdict = font1)
plt.xlabel('Bill Depth')
plt.ylabel('Count')
plt.show()

## 6. Flipper Length (in mm)

In [None]:
sns.histplot(data = penguins_complete, x = "flipper_length_mm", kde = True)
font1 = {'family':'serif','color':'black','size':20}
plt.title('Flipper Length of Penguins (in mm)', fontdict = font1)
plt.xlabel('Flipper Length')
plt.ylabel('Count')
plt.show()

## 7. Body Mass (in grams)

In [None]:
sns.boxplot(data = penguins_complete, x = 'body_mass_g')

In [None]:
sns.histplot(data = penguins_complete, x = "body_mass_g", kde = True)
font1 = {'family':'serif','color':'black','size':20}
plt.title('Body Mass of Penguins (in grams)', fontdict = font1)
plt.xlabel('Body Mass')
plt.ylabel('Count')
plt.show()

***

### Bill Length by gender, island and species 

In [None]:
fig, axs = plt.subplots(3, 1, figsize=(12, 15))
sns.swarmplot(data=penguins_complete, x='bill_length_mm', hue='sex', ax=axs[0])
sns.swarmplot(data=penguins_complete, x='bill_length_mm', hue='island', ax=axs[1])
sns.swarmplot(data=penguins_complete, x='bill_length_mm', hue='species', ax=axs[2])

axs[0].set_title('Bill Length by Sex')
axs[1].set_title('Bill Length by Island')
axs[2].set_title('Bill Length by Species')

plt.subplots_adjust(hspace = 0.5)

plt.show()

## Bill Depth by gender, island and species 

In [None]:
fig, axs = plt.subplots(3, 1, figsize=(10, 15))
sns.histplot(data=penguins_complete, x='bill_depth_mm', hue='sex', multiple = 'stack', ax=axs[0])
sns.histplot(data=penguins_complete, x='bill_depth_mm', hue='island', multiple = 'stack', ax=axs[1])
sns.histplot(data=penguins_complete, x='bill_depth_mm', hue='species', multiple = 'stack', ax=axs[2])

axs[0].set_title('Bill Depth by Sex')
axs[1].set_title('Bill Depth by Island')
axs[2].set_title('Bill Depth by Species')

plt.subplots_adjust(hspace = 0.5)

#plt.tight_layout()
plt.show()

## Flipper length by gender, island and species

In [None]:
fig, axs = plt.subplots(3, 1, figsize=(10, 15))
colors = ('red', 'blue', 'pink')
sns.histplot(data=penguins_complete, x='flipper_length_mm', hue='sex', multiple="dodge", shrink=.8, ax=axs[0])
sns.histplot(data=penguins_complete, x='flipper_length_mm', hue='island', multiple="dodge", shrink=.8, ax=axs[1])
sns.histplot(data=penguins_complete, x='flipper_length_mm', hue='species', multiple="dodge", shrink=.8, palette = colors, ax=axs[2])
# multiple = dodge sets the bars side by side in groups

axs[0].set_title('Flipper Length by Sex')
axs[1].set_title('Flipper Length by Island')
axs[2].set_title('Flipper Length by Species')

plt.subplots_adjust(hspace = 0.5)

plt.show()

## Body mass by gender, island and species

In [None]:
fig, axs = plt.subplots(3, 1, figsize=(10, 15))
sns.histplot(data=penguins_complete, x='body_mass_g', hue='sex', multiple = 'stack', ax=axs[0])
sns.histplot(data=penguins_complete, x='body_mass_g', hue='island', multiple = 'stack', ax=axs[1])
sns.histplot(data=penguins_complete, x='body_mass_g', hue='species', multiple = 'stack', ax=axs[2])

axs[0].set_title('Body Mass by Sex')
axs[1].set_title('Body Mass by Island')
axs[2].set_title('Body Mass by Species')

plt.subplots_adjust(hspace = 0.5)

#plt.tight_layout()
plt.show()

In [None]:
sns.pairplot(penguins_complete, hue="species")