# Fundamentals of Data Analysis Tasks

## Rebecca Feeley

*** 

### Task 1 
> To verify, using Python, that the Collatz conjecture is true for the first 10,000 positive integers.

The Collatz Conjecture is a mathematical problem which has not been definitively solved as of yet. The Collatz Conjecture is based on the below equation:  

\begin{equation*}
f(x) =
\begin{cases}
x \div 2 & \text{if } x \equiv 0 , \text{ if x is even}\\ 
3x+1 & \text{if } x \equiv 1 , \text{ in all other cases}.
\end{cases}
\end{equation*}

In simpler terms, the Collatz Conjecture is that if you pick any positive integer, and if its even, divide it by 2; if its odd, multiply it by 3 and add 1. This process should be repeated with the resulting number and it should eventually reach 1. The Collatz Conjecture states that no matter what positive integer is used, the result will always eventually reach 1.

The task here is to verify that the Collatz Conjecture holds true for the first 10,000 positive prime integers. 


**The first set of code is to demonstrate the Collatz Conjecture itself by asking the user to input a positive integer and demonstrating how the conjecture will eventually reach 1 for this number.**

In [None]:
 # Asks user to input  positive integer and outputs successive values of the following calculation:
# if even, divide by 2; if odd, multiply by 3 and add 1

number = int(input("Enter any positive integer: ")) # Asks user to input a number and converts it to an integer
collatz_numbers = [] # a list is created to which each number will be added

while number <= 0: # While loop is created to catch any negative numbers inputted so program runs properly
    print ("The number inputted must be a positive integer.")
    number = int(input("Please enter a positive integer: "))

def collatz(number): # I created a function to carry out the calculations required by the Collatz Conjecture
    if (number % 2) == 0: # the number entered is checked using modulus to see if it is even. If it is even, 
        number = (number // 2) # this code runs and the number is divided by 2 and added to collatz_numbers list
    elif (number % 2) == 1: # else if the number entered is odd,
        number = (number * 3 + 1) # this code runs and the number is multiplied by 3 and 1 is added. 
    return (number)

while number != 1: # this creates a while loop that runs until the output equals 1
    collatz_numbers.append(number) #Its appended to collatz_numbers list 
    number = collatz(number)
    # this loop continues until the number 1 is the outcome. 
    # The outputs of the if loop collatz function are appended to the collatz_numbers list
collatz_numbers.append(number)
print(*collatz_numbers) # the '*' is used to unpack the collatz_numbers list into seperate arguments and prints them out



### The above code demonstrates the Collatz Conjecture in action to the user. Now, I will verify the Collatz Conjecture for the first 10,000 positive integers. 

In [None]:
def collatz(number): # I created a function to carry out the calculations required by the Collatz Conjecture
    if (number % 2) == 0: # the number entered is checked using modulus to see if it is even. If it is even, 
        number = (number // 2) # this code runs and the number is divided by 2 and added to collatz_numbers list
    elif (number % 2) == 1: # else if the number entered is odd,
        number = (number * 3 + 1) # this code runs and the number is multiplied by 3 and 1 is added. 
    return (number)

#collatz_numbers = []
#for i in range (1, 10001):
    number = i
    while number != 1: # this creates a while loop that runs until the output equals 1
        collatz_numbers.append(number) #Its appended to collatz_numbers list 
        number = collatz(number)
    # this loop continues until the number 1 is the outcome. 
    # The outputs of the if loop collatz function are appended to the collatz_numbers list
#collatz_numbers.append(number)

#print(*collatz_numbers)


collatz_dict = {} # here I am creating a dictionary to store the values
for i in range(1, 10001):
    number = i
    collatz_sequence = []
    while number != 1:
        collatz_sequence.append(number)
        number = collatz(number)
    collatz_sequence.append(number)
    collatz_dict[i] = collatz_sequence

for key, value in collatz_dict.items():
    print(f"Starting Number: {key}, Sequence: {value}") # shows all of the values down until 1, thus proving the Collatz Conjecture 

The End of Task 1

***

## Task 2

> Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale. 

![Penguins](https://www.gabemednick.com/post/penguin/featured.png) 


The Palmer Penguins Data Set is a data set which has been provided by Dr Kristen Gorman and the Palmers Station, Antarctica Long Term Ecological Research (LTER) Program and it is often presented as an alternative to the famous Iris Data Set.
Dr Gorman and the LTER Program studied 344 adult penguins of 3 species: Adélie, Chinstrap and Gentoo penguins.
These 344 penguins were studied across three islands on the Palmer Archipelgao near Palmer Station, Antarctica.
Dr Gorman and the LTER program collected various types of information on the penguins, which we will see below.


### _Explaining the different types of data_

Data can be separated into 2 primary classes: quantitative data and qualitative data. 

- Qualitative data is also known as categorical data. Categorical data is data which cannot be easily counted or measured using numbers. They hold no numerical value through which they can be ranked. Rather, the variables can only be sorted by category. Examples include car brands, gender, and hair colour. 

- Quantitiative data, is also known as numerical data. This is data which can be quantified, i.e it has an intrinsic numerical value through which it can be measured. Thus, numerical data can be expressed in numbers and thus ranked. 
Examples include height, speed, temperature etc.

The above 2 classifications can be futher separated into 4 sub-types of data: nominal, ordinal, discrete and continuous data.

Nominal data is a subset of qualitative data. It contains variables which do not have an intrinsic value through which they can be ranked or ordered. They are not quantifiable and cannot be measured numerically. 
For example, hair colour does not have naturally have a numerical value through which it can be ranked.

Ordinal data is also a subset of qualitative data. It contains variables which are in an order (or sequence) due to the relation amongst the different categories i.e they have a natural ordering on a scale but stil maintain their class of value. This differentiates it to nominal data.
However, ordinal data does not have a numerical value and so while it can be assigned a number in order to show its position, it cannot be used to do arithimetic. For example, the top three finishers in a marathon.

Discrete data is a subset of quantitative data. It contains variables which are numerical, but only includes integers which have a limited possible amount of values. They cannot be further divided into smaller parts. 
A clear example of this is the days of the month. 

Continous data is also a subset of quantitative data. It contains variables which can be broken down into even smaller values. It does not have to be only integers, as is the case with discrete data. An example includes weight of humans, which can be broken down to its fractional value.
Continous data can be even further subdivided into interval and ratio data. 
Interval data is data which can be categorised, ordered and evenly spaced (even intervals) but does not have a natural zero point.
Ratio data is data which can be categorised, ordered, evenly spaced and also have a natural zero point.

### _Description of the columns/data in the data set_ 
The Palmer Penguins data set contains 344 rows and 7 columns. The variables in the columns of the data set are as follows:

Species: The species refers to one of three penguin species studied (Adelie, Gentoo or Chinstrap).   
Island: This refers to the island upon which the penguin was observed (Biscoe, Dream or Torgersen).   
Bill Length (in mm): The length of the peguin's bill in millimeters.  
Bill Depth (in mm): The depth of the penguin's bill in millimeters.  
Flipper Length (in mm): The length of the penguin's flipper in millimeters.  
Body Mass (in grams): The body mass of the penguins in grams.  
Sex: This refers to the sex of the penguin (either male or female).  

### _Classifying each of the variables in the Penguins Data Set_

The variables in the data set can be separated into two primary classifications: categorical (or qualitative) and numerical (or quantitative).

The variables Species, Island and Sex are all categorical nominal data types. In all three of these variable types, there is no instrinic numerical value or method of ordering them. For example, the species type Adelie, Gentoo or Chinstrap cannot be ordered as there is no value through which to order them.

The variables Bill Length, Bill Depth, Flipper Length and Body Mass are all numerical continous data types. FOr example, these variables can be broken down further than just integer values, and can express any value including fractional values.

Furthermore, as all of the continous data types have a defined natural zero point, i.e all of these variables have a set 0 as none of these measurements can be less than 0, these variables are also ratio data.


In [None]:
import pandas as pd
import numpy as np

Penguins = pd.read_csv('penguins.csv')
Penguins.describe()

Penguins.info()

The above code describes the 3 categorical data types as objects; it describes the 4 numerical data types as floats.

### _What variable types should be used to model these variables in Python_

I believe that it is best for the species and island variables to be stored in Python as strings as these data types can not be used to complete arithimetic. While some may propose assigning numerical values to each category, such as 1 for Biscoe, 2 for Dream, I would not recommend this approach as these assigned numerical values may get confused with actual numerical data with which arithemtic can be performed. Also, it increases the likelihood of an error in the Python code as the values assigned to the species must be consistent throughout the code. While the Sex variable slightly differs in that there is only 1 of 2 possibilities (i.e it is dichotomous) and so it could be stored as a bool data type, I think that it is better to keep a more uniform approach to the nominal data and so it should be stored as a string. This will help prevent errors in the code.

The Bill Length, Bill Depth, Flipper Length and Body Mass variables should be stored as floats in Python as they are continous quantitative data and so their measurements will need to be very precise, to the exact millimeter. While this could lead to rounding errors, it is important that these variables are stored as floating point numbers to allow arithemetic and analysis to be performed on them.

While the Sex variable could be stored as a string like the other nominal data, I think that the more appropriate data type would be the bool data type in Python 

The End of Task 2 

***

## Task 3

> For each of the variables in the penguins data set, suggest what probability distribution from the mumpy random distributions list is the most appropriate to model the variable. 

# Explain each of the different probability distributions in Python

# Histograms etc of each of the variables

# Check the distribution

# Say which one is most appropriate and explain why


# References
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html




In [None]:
# importing all of the relevant libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

The end of Task 3

***

## Task 4

> Suppose you are flipping two coins, each with a probability p of giving heads. Plot the entropy of the total number of heads versus p.


\begin{equation*}
H(x) =
\begin{cases}
-\sum_{i=1}^{n} P(x_i) \log_2 P(x_i).
\end{cases}
\end{equation*}