# Fundamentals of Data Analysis Tasks

__Rebecca Hannah Quinn__

### Task 1: Collatz Conjecture

> The Collatz conjecture is a famouse unsilved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apple the function $f(x)$ below, you always get stuck in the repeating sequence $1, 4, 2, 1, 4, 2...$

__This task is to verify, using Python, that the conjecture is true for the first 10,000 positive integers.__

![Collatz Conjecture Equation](https://bpb-us-e1.wpmucdn.com/sites.dartmouth.edu/dist/4/417/files/2019/11/gyorda_article_1_picture.png)

In [None]:
#defines the basic function of the collatz conjuncture
def f(x):
    #if the integer can be divided by 2 (is even) then divide by two and return answer
    if x % 2 == 0:
        return x // 2
    #otherwise the integer is determind odd so multiply by 3 and add 1
    else:
        return (3 * x) + 1

In [None]:
#function to print the results and to loop the function
def collatz(x):
    # Prints a formatted statement with the current integer to verify
    print(f'Testing Collatz with the initial value of {x}')
    #Loops the function as long as the interger results from the above function does not equal 1
    while x != 1:
        x = f(x)
        #formats the results below with spacing and commas
        print(x, end=', ')

In [None]:
# Integer to verify using above function, calling the function with the integer we must verify
print(collatz(10000))

---

### Task 2: Penguins Data Set

> An overview of the famous penguins data set, explaining the types of variables it contains and suggestions of variables that should be used to model them in Python and why.



The dataset known as the "penguin" dataset refers to the Palmer Penguins dataset collected between 2007 and 2009 in the Palmer Archipelago in Antartica. It lists the species information on three species of Penguin. These measurements include the bill length and depth, flipper length, body mass and the island, species name and sex catagories. These measurements when investigated can help us to understand the characteristics and behaviours of the three species of Penguin - Chinstrap, Gentoo and Adélie.

The data types are strings and numerical. 

| **Chinstrap** | **Gentoo** | **Adélie** |
| ------------- | ---------- | ---------- |
| ![Chinstrap Penguin](/Tasks/Images/chinstrappenguin.jpef) | ![Gentoo Penguin](/Tasks/Images/gentoopenguin.png) | ![Adélie Penguin](/Tasks/Images/adeliepenguin.jpeg) |
| | | NOOT NOOT |

!!!refimages

---

### Task 3: Variables and Random Distributions 

> Suggest which probability distribution from NumPy random distributions list that is most appropriate to model each of the variables in the "penguins" dataset.



#### Data Information

> To better help understand the information we are dealing with and dislay information only the computer can confirm

In [None]:
import numpy as np  
import pandas as pd

#reading in the CSV file - acquired from:
# Read in CSV file
#prints the top 5 rows of the data
df = pd.read_csv("penguins.csv")
print(df.head())


In [None]:

#checks how many rows and columns
print(df.shape)
#Result is 344 rows and 7 columns (does not include the row numbers column)


In [None]:

#checks the count of each species and island (the both are multinominal distribution type)
print(df.value_counts("species"))


In [None]:

print(df.value_counts("island"))



In [None]:

#checks for null values
print(df.isnull().sum())


In [None]:

#summarise the information in order to read/write new file

summarize = df.describe()
#Outputs a summary of each variable to a single text file to get different values for later referenace and use in tasks and projects:
with open ("summary.txt", "w+") as file:
    file.write(summarize.to_string())
file.close()

![summary.txt text file results with various values include mean, count and standard deviation](/Tasks/Images/summarytxtfile.png)



#### Variables and Their Random Probability Distributions

| Variable | Random Probability Distribution | Reasoning |
| -------- | ------------------------------- | --------- |
| species | multinomial distribution | as there are more than 2 possible species, the multinomical distribution used on the catagorical variable alows us to investigate the distribution and frequency of the species and their island habitats |
| island | multinomial distribution | as there are more than 2 islands named, the multinomical distribution used on the catagorical variable alows us to investigate the distribution and frequency of the species and their island habitats |
| sex | binomial distribution | as the outcome is catagorical but has only two outcomes: male or female |
| bill_length_mm | normal distribution | as it is a continuous numerical measurement |
| bill_depth_mm | normal distribution | as it is a continuous numerical measurement |
| flipper_length_mm | normal distribution | as it is a continuous numerical measurement |
| body_mass_g | normal distribution | as it is a continuous numerical measurement |

[^1] [^2] [^3]

---


In [None]:
#list comprehensions use
# importing libraries for use in tasks/projects
import numpy as np
#ref
import matplotlib as plt
#ref
import random
#ref
#random for numerical arrays to use for exploring data set


---

## References:

[^1]: https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html
[^2]: https://numpy.org/doc/stable/reference/random/generated/numpy.random.binomial.html
[^3]: https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
https://www.markdownguide.org/cheat-sheet/ 

---

## End