#### Randy Baicich                                                                                    

# Capstone Project 1: 

# *Shark Attacks in Coastal Waters*

The dataset sourced to be used in this project was originally collected from the Global Shark Attack file on the [Shark Research Institute's website](https://www.sharkattackfile.net). The dataset is available for download from [Kaggle](https://www.kaggle.com/c/shark-attack-dataset).

*In this notebook, I will use the Shark Attack dataset to perform a comprehensive analysis of the data. The first step is to import the necessary libraries and modules to facilitate data processing and analysis. Once the dataset is imported, the next step is to ensure that the data is tidy, which involves organizing and cleaning it for further analysis. This includes tasks such as removing unnecessary columns, capitalizing column names, and removing extra spaces. After tidying the data, the next step is to transform it by performing various calculations to derive meaningful insights. This may involve calculating statistics such as counts or averages as well as creating new features based on existing data. Once the data is transformed, the next phase is to visualize the data. I will create several visualizations, including scatter plots and bar charts, as well as use [Tableu](https://public.tableau.com/app/discover) to explore relationships between different variables and gain insights into the patterns and distributions present in the data. I will also attempt to connect to an [ElephantSQL](https://www.elephantsql.com/) instance, to perform queries using the created tables. Finally, the last step is to communicate the findings and insights obtained from the analysis. This includes summarizing the key findings, presenting visualizations, and providing interpretations and recommendations based on the results.*

# Part 1: *Import,Clean, and Save Data.*

#### *Import all necessary libraries.*

In [None]:
import psycopg2
import pandas as pd
import matplotlib.pyplot as plt

#### *Import the sourced Shark Attack data.*

In [None]:
data = pd.read_csv('C:\Users\RedneckRandy\Documents\GitHub\Capstone-Project-1\GSAF5.xls.csv')


#### *Clean the CSV file/data.*

In [None]:
# Capitalize all columns
data.columns = [col.capitalize().strip() for col in data.columns]

In [None]:
# Remove extra space in column names
data.columns = data.columns.str.replace(' ', '')

In [None]:
# Remove columns starting with "Unnamed"
data = data.loc[:, ~data.columns.str.startswith('Unnamed')]

In [None]:
# Save the cleaned CSV as 'shark_sorted'
data.to_csv('sharks_sorted.csv', index=False)

#### *Save the new cleaned CSV.*

In [None]:
data.to_csv('sharks_sorted.csv', index=False)

# Part 2: Analysis of the data.

In [None]:
# Count total 'Y' in 'Fatal (Y/N)'
total_Y_fatal = data['Fatal(Y/N)'].str.count('Y').sum()

In [None]:
# Count total 'N' in 'Fatal (Y/N)'
total_N_fatal = data['Fatal(Y/N)'].str.count('N').sum()

In [None]:
# Average 'Age' of total 'Y'
average_age_Y_fatal = data.loc[data['Fatal(Y/N)'] == 'Y', 'Age'].mean()

In [None]:
# Total count for each unique value in 'Location' column
location_totals = data['Location'].value_counts()

In [None]:
# Total count for each unique value in 'Species' column
species_totals = data['Species'].value_counts()

In [None]:
# Total count for each unique value in 'Activity' column
activity_totals = data['Activity'].value_counts()

In [None]:
# Total count for each unique value in 'Type' column
type_totals = data['Type'].value_counts()

In [None]:
# Average of "Time"
time_average = data['Time'].mean()