#### Randy Baicich                                                                                    

# Capstone Project 1: 

# *Shark Attacks in Coastal Waters*

The dataset sourced to be used in this project was originally collected from the Global Shark Attack file on the [Shark Research Institute's website](https://www.sharkattackfile.net). The dataset is available for download from [Kaggle](https://www.kaggle.com/c/shark-attack-dataset).

*In this notebook, I will use the Shark Attack dataset to perform a comprehensive analysis of the data. The first step is to import the necessary libraries and modules to facilitate data processing and analysis. Once the dataset is imported, the next step is to ensure that the data is tidy, which involves organizing and cleaning it for further analysis. This includes tasks such as removing unnecessary columns, capitalizing column names, and removing extra spaces. After tidying the data, the next step is to transform it by performing various calculations to derive meaningful insights. This may involve calculating statistics such as counts or averages as well as creating new features based on existing data. I will also attempt to connect to an [ElephantSQL](https://www.elephantsql.com/) instance, to perform queries using the created tables. Once connected to the instance the transformed data is then used to visualize the data. I will create several visualizations, including scatter plots and bar charts, as well as use [Tableu](https://public.tableau.com/app/discover) to explore relationships between different variables and gain insights into the patterns and distributions present in the data. Finally, the last step is to communicate the findings and insights obtained from the analysis. This includes summarizing the key findings, presenting visualizations, and providing interpretations and recommendations based on the results.*

# Starting Hypothesis.

#### *Proceeding the analysis of the Shark Attack dataset, I hypothesize that a select few shark species will be responsible for the majority of shark attacks and fatalities, rather than a wide variety of species. Additionally, I believe that the majority of shark attacks will occur at specific times, indicating a temporal pattern in shark-human interactions. By exploring the data and conducting statistical analysis, I aim to determine if these hypotheses hold true and gain insights into the key factors influencing shark attacks and fatalities.*

## Part 1: *Import, Clean, and Save Data.*

#### *Import all necessary libraries.*

In [None]:
import psycopg2
import pandas as pd
import matplotlib.pyplot as plt

#### *Import the sourced Shark Attack data.*

In [None]:
data = pd.read_csv('C:\Users\RedneckRandy\Documents\GitHub\Capstone-Project-1\GSAF5.xls.csv')


#### *Clean the CSV file/data.*

In [None]:
# Capitalize all columns
data.columns = [col.capitalize().strip() for col in data.columns]

In [None]:
# Remove extra space in column names
data.columns = data.columns.str.replace(' ', '')

In [None]:
# Remove columns starting with "Unnamed"
data = data.loc[:, ~data.columns.str.startswith('Unnamed')]

In [None]:
# Save the cleaned CSV as 'shark_sorted'
data.to_csv('sharks_sorted.csv', index=False)

#### *Save the new cleaned CSV.*

In [None]:
data.to_csv('sharks_sorted.csv', index=False)

## Part 2: *Analysis of the data*.

In [None]:
# Count total 'Y' in 'Fatal (Y/N)'
total_Y_fatal = data['Fatal(Y/N)'].str.count('Y').sum()

In [None]:
# Count total 'N' in 'Fatal (Y/N)'
total_N_fatal = data['Fatal(Y/N)'].str.count('N').sum()

In [None]:
# Average 'Age' of total 'Y'
average_age_Y_fatal = data.loc[data['Fatal(Y/N)'] == 'Y', 'Age'].mean()

In [None]:
# Total count for each unique value in 'Location' column
location_totals = data['Location'].value_counts()

In [None]:
# Total count for each unique value in 'Species' column
species_totals = data['Species'].value_counts()

In [None]:
# Total count for each unique value in 'Activity' column
activity_totals = data['Activity'].value_counts()

In [None]:
# Total count for each unique value in 'Type' column
type_totals = data['Type'].value_counts()

In [None]:
# Average of "Time"
time_average = data['Time'].mean()

# Part 3: *Connecting to the database as well as Creating and Inserting into tables.*

#### *Connect to ElephantSQL*

In [None]:
conn = psycopg2.connect(dbname='gblqlzwo',
                        user='gblqlzwo',
                        password='UkEdnFRHD1w6hKODlEDEqHMIKujC814K',
                        host='rajje.db.elephantsql.com')
cur = conn.cursor()

#### *Create neccessary tables.*

In [None]:
# Define the columns for the table
columns = ['index', 'Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity', 'Name',
           'Unnamed: 9', 'Age', 'Injury', 'Fatal (Y/N)', 'Time', 'Species', 'Investigator or Source', 'pdf',
           'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order']

create_table_query = '''
    CREATE TABLE shark_data(
        "index" INT,
        "Case Number" VARCHAR(100),
        "Date" VARCHAR(100),
        "Year" INT,
        "Type" VARCHAR(100),
        "Country" VARCHAR(100),
        "Area" VARCHAR(100),
        "Location" VARCHAR(100),
        "Activity" VARCHAR(100),
        "Name" VARCHAR(100),
        "Unnamed: 9" VARCHAR(100),
        "Age" VARCHAR(100),
        "Injury" VARCHAR(100),
        "Fatal (Y/N)" VARCHAR(100),
        "Time" VARCHAR(100),
        "Species" VARCHAR(100),
        "Investigator or Source" VARCHAR(100),
        "pdf" VARCHAR(100),
        "href formula" VARCHAR(100),
        "href" VARCHAR(100),
        "Case Number.1" VARCHAR(100),
        "Case Number.2" VARCHAR(100),
        "original order" INT
    )
'''
cur.execute(create_table_query)

#### *Insert into the created table.*

In [None]:
for _, row in data[columns].iterrows():
    insert_query = '''
        INSERT INTO shark_data ("Index", "Case Number", "Date", "Year", "Type", "Country", "Area", "Location",
                                 "Activity", "Name", "Age", "Injury", "Fatal(Y/N)", "Time", "Species",
                                 "Investigator or Source", "Pdf", "Href formula", "Href", "Case Number.1",
                                 "Case Number.2", "Original order")
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    '''
    cur.execute(insert_query, tuple(row))

conn.commit()

# Part 4: *Visualize the data and communicate your results.*

#### *Visualization 1: Species vs Type of Attack.*

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(data['Species'], data['Type'])
plt.xlabel('Species')
plt.ylabel('Type')
plt.title('Species vs Type')
plt.xticks(rotation=90)
plt.show()

#### *Visualization 2: Type of Attack vs Activity Being Performed.*

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(data['Type'], data['Activity'])
plt.xlabel('Type')
plt.ylabel('Activity')
plt.title('Type vs Activity')
plt.xticks(rotation=90)
plt.show()

#### *Visualization 3: Total Fatalities.*

In [None]:
fatal_counts = data['Fatal (Y/N)'].value_counts()
plt.figure(figsize=(6, 6))
plt.bar(['Y', 'N'], fatal_counts)
plt.xlabel('Fatal')
plt.ylabel('Count')
plt.title('Fatal (Y/N) Distribution')
plt.show()

#### *Visualization 4: Fatalities by Species.*

In [None]:
species_fatal_counts = data.groupby('Species')['Fatal (Y/N)'].value_counts().unstack().fillna(0)
plt.figure(figsize=(10, 6))
species_fatal_counts.plot(kind='bar', stacked=True)
plt.xlabel('Species')
plt.ylabel('Count')
plt.title('Species vs Fatal (Y/N)')
plt.xticks(rotation=90)
plt.legend(title='Fatal (Y/N)')
plt.show()

#### *Close Connections*

In [None]:
cur.close()
conn.close()

# Part 5: *Final Results, Summary, and Conclusion.*