# Fundamentals of Data Analysis Tasks

__Rebecca Hannah Quinn__

### Task 1: Collatz Conjecture

> The Collatz conjecture is a famouse unsilved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below.

__This task is to verify, using Python, that the conjecture is true for the first 10,000 positive integers.__

![Collatz Conjecture Equation](https://bpb-us-e1.wpmucdn.com/sites.dartmouth.edu/dist/4/417/files/2019/11/gyorda_article_1_picture.png)

In [74]:
#defines the basic function of the collatz conjuncture
def f(x):
    #if the integer can be divided by 2 (is even) then divide by two and return answer
    if x % 2 == 0:
        return x // 2
    #otherwise the integer is determind odd so multiply by 3 and add 1
    else:
        return (3 * x) + 1

In [75]:
#function to print the results and to loop the function
def collatz(x):
    # Prints a formatted statement with the current integer to verify
    print(f'Testing Collatz with the initial value of {x}')
    #Loops the function as long as the interger results from the above function does not equal 1
    while x != 1:
        x = f(x)
        #formats the results below with spacing and commas
        print(x, end=', ')

In [76]:
# Integer to verify using above function, calling the function with the integer we must verify
print(collatz(10000))

Testing Collatz with the initial value of 10000
5000, 2500, 1250, 625, 1876, 938, 469, 1408, 704, 352, 176, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1, None


---

### Task 2: Penguins Data Set

> An overview of the famous penguins data set, explaining the types of variables it contains and suggestions of variables that should be used to model them in Python and why.



The dataset known as the "penguin" dataset refers to the Palmer Penguins dataset collected between 2007 and 2009 in the Palmer Archipelago in Antartica. It lists the species information on three species of Penguin. These measurements include the bill length and depth, flipper length, body mass and the island, species name and sex catagories. These measurements when investigated can help us to understand the characteristics and behaviours of the three species of Penguin - Chinstrap, Gentoo and Adélie.

The data types are strings and numerical. 

| **Chinstrap** | **Gentoo** | **Adélie** |
| ------------- | ---------- | ---------- |
| ![Chinstrap Penguin](https://github.com/rebhanqui/data_funds/blob/main/Tasks/Images/chinstrappenguin.jpeg?raw=true) | ![Gentoo Penguin](https://github.com/rebhanqui/data_funds/blob/main/Tasks/Images/gentoopenguin.png?raw=true) | ![Adélie Penguin](https://github.com/rebhanqui/data_funds/blob/main/Tasks/Images/adeliepenguin.jpeg?raw=true) |
| | | NOOT NOOT |

!!!refimages

---

### Task 3: Variables and Random Distributions 

> Suggest which probability distribution from NumPy random distributions list that is most appropriate to model each of the variables in the "penguins" dataset.

#### Variables and Their Random Probability Distributions

| Variable | Random Probability Distribution Type | Reasoning |
| -------- | ------------------------------- | --------- |
| species | multinomial distribution | as there are more than 2 possible species, the multinomical distribution used on the catagorical variable alows us to investigate the distribution and frequency of the species and their island habitats |
| island | multinomial distribution | as there are more than 2 islands named, the multinomical distribution used on the catagorical variable alows us to investigate the distribution and frequency of the species and their island habitats |
| sex | binomial distribution | as the outcome is catagorical but has only two outcomes: male or female |
| bill_length_mm | normal distribution | as it is a continuous numerical measurement |
| bill_depth_mm | normal distribution | as it is a continuous numerical measurement |
| flipper_length_mm | normal distribution | as it is a continuous numerical measurement |
| body_mass_g | normal distribution | as it is a continuous numerical measurement |

[^1] [^2] [^3]

---


### Task 4: Plot the Entropy

> Suppose you are flipping two coins, each with a probability of **p** of giving heads. Plot the entropy of the total number of heads verses **p**.

The probability distribution of coin flipping would be binomial, as there are only one of two end results.

In [77]:
# importing libraries for use in tasks/projects - numpy for numerical arrays and random numbers
import numpy as np
#REF: https://atlantictu-my.sharepoint.com/:v:/r/personal/ian_mcloughlin_atu_ie/Documents/student_shares/fundamentals_of_data_analysis/4_information/fund_t04v02_np_binomial_play.mkv?csf=1&web=1&e=Qd5qil&nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJTdHJlYW1XZWJBcHAiLCJyZWZlcnJhbFZpZXciOiJTaGFyZURpYWxvZyIsInJlZmVycmFsQXBwUGxhdGZvcm0iOiJXZWIiLCJyZWZlcnJhbE1vZGUiOiJ2aWV3In19

import matplotlib.pyplot as plt
#to plot chart to show the entropy of total number of heads verses p
#ref

#create function to find number of heads in 2 coins tossed
#REF: https://atlantictu-my.sharepoint.com/:v:/r/personal/ian_mcloughlin_atu_ie/Documents/student_shares/fundamentals_of_data_analysis/4_information/fund_t04v03_bernoulli.mkv?csf=1&web=1&e=GW7Xza&nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJTdHJlYW1XZWJBcHAiLCJyZWZlcnJhbFZpZXciOiJTaGFyZURpYWxvZyIsInJlZmVycmFsQXBwUGxhdGZvcm0iOiJXZWIiLCJyZWZlcnJhbE1vZGUiOiJ2aWV3In19

from scipy.stats import entropy

def entropy():
    




IndentationError: expected an indented block (814487097.py, line 17)

### Task 5: Plotting the Penguin Data Set

> Create appropriate individual plots for each variable in the Penguin Data Set

#### Summarizing The Penguin Dataset

To get the basic information of the data we will summarize it with `head()`, `shape()` to how much data we are working with and a small look at the data and it's variables we need to work with. We also check how many values there are of particular variables with `value_counts()` for `sex`, `species`, and `island`. Then we check for null values through out and print a summary of the data into a file to compare the results of our plots with.

In [78]:
import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

penguins = pd.read_csv("/Users/rebeccaquinn/Desktop/Data Analysis 22-24/S2-23/data_funds/Tasks/Relevant Files/penguins.csv")
print(penguins.head())


print("The penguin dataset contains (Rows, Columns):")
penguins.shape

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    MALE  
1       3800.0  FEMALE  
2       3250.0  FEMALE  
3          NaN     NaN  
4       3450.0  FEMALE  
The penguin dataset contains (Rows, Columns):


(344, 7)

In [None]:
#checks the count of each species
print(penguins.value_counts("species"))

print(penguins.value_counts("island"))

print(penguins.value_counts("sex"))

In [None]:
#checks for null values
print(penguins.isnull().sum())

In [None]:
#summarize the information in order to read/write new file
summarize = penguins.describe()
#Outputs a summary of each variable to a single text file:
with open ("summary.txt", "w+") as file:
    file.write(summarize.to_string())
file.close()

#### Scatterplot

In [112]:
import seaborn as sns

plt.rcParams["figure.figsize"] = [12.00, 12.00]
fig, axes = plt.subplots(2, 2)

# Adjust the subplot 
fig.subplots_adjust(hspace=0.50, wspace=0.25)

sns.scatterplot(data=penguins, x="bill_depth_mm", y="bill_length_mm", hue="species", ax=axes[0,0])
sns.scatterplot(data=penguins, x="bill_depth_mm", y="bill_length_mm", hue="island", ax=axes[0,1])
sns.scatterplot(data=penguins, x="bill_depth_mm", y="bill_length_mm", hue="sex", ax=axes[1,0])

plt.show()
#plt.savefig("./Images/scatterplot_variables.jpeg")

plt.close()
#REF https://www.statology.org/seaborn-title/
#ref https://www.tutorialspoint.com/how-to-adjust-the-space-between-matplotlib-seaborn-subplots-for-multi-plot-layouts


---

## References:

[^1]: https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html
[^2]: https://numpy.org/doc/stable/reference/random/generated/numpy.random.binomial.html
[^3]: https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
https://www.markdownguide.org/cheat-sheet/ 
Images Issue: https://stackoverflow.com/questions/13051428/how-to-display-images-in-markdown-files-on-github

---

## End