# PROJECT 02 :
-----

### BRIEF : data analysis of global shark attacks for a business idea
You will initially examine the Shark Attack dataset, understanding its structure and formulating a hypothesis or several hypotheses about the data. 
 We hypothesize that shark attacks are more common in certain locations and peak during specific months.
We define a Business Case, such as 

* ‚ÄúAs a company that sells medical products, I want to identify destinations with high shark attack rates.‚Äù
* ‚ÄúAs a company providing supply transportation services, I want to know when and where shark attacks peak to plan the safe transport of medical supplies to hospitals.‚Äù

 Throughout the project, we will use Python and the pandas library to apply at least five data cleaning techniques to handle missing values, duplicates, and formatting inconsistencies. After cleaning, we will perform basic exploratory data analysis to validate our hypotheses and extract insights. 

üìù BUSINESS IDEA ‚Äî 3 Bullet Points

* Problem to Solve: Coastal hospitals and emergency response teams are not always prepared with the right medical supplies during periods of high shark-attack frequency. This business solves the problem by predicting when and where attacks are most likely, so medical supplies can be stocked in advance.

* Business Concept: Use historical shark attack data to create global heatmaps and seasonal risk forecasts. Then provide pharmaceutical products (painkillers, antibiotics, blood bags, emergency kits) and transportation support to hospitals and ambulances near high-risk beaches.

* Data Used to Profit: The business will analyze Country, Date/Month, Gender, Age, Fatality, and Type of Injury to identify high-risk locations, peak attack months, and most common injury types. This allows optimized supply production, targeted sales, and efficient delivery to the areas that need it most.



## üß±COLUMNS TO USE FROM ORIGINAL DATASET:
------

Country (global comparison between countries to invest in more)

    * we just take the country column and make sure every country name is accurately named
    * We are gonna check the column of COUNTRY and make sure every country name is correct
    * We remove NULL COUNTRY data rows that dont have any country
    * We need to make sure that the name of COUNTRY  is capitalized and written the same way for example :
        United States of Amercia == USA == US
        it has to have the same name and consistent!

DATE ( MONTH + YEAR )

    * split the date into day - month columns and only use month column
    * interpret months which have shark attacks happen the most

GENDER ( F or M)

    * We check unique values and make it so it is only two values F or M and deleted all rows that have other values
    * we noticed mostly M get attacked
    * percentages Male to Female victims
    * We remove NULL DATE data rows that dont have a date or that the date doesnt include a month and a year
    * We are gonna check the column of DATE and seperate it into three columns DAY + MONTH + YEAR
    * We verify that the new YEAR column matches with the old YEAR column and keep the ones that match
        * NEW YEAR COLUMN is the one split from the DATE column
        * OLD YEAR COLUMN is the one already existing in the original sheet
    * Once we finish comparing the new YEAR column vs old YEAR column and we find them not matching on some data rows. we remove the none matching ones so we keep clean data of accurate years
    * We remove the DAY and OLD YEAR columns
    * We are gonna keep the MONTH and matching YEAR

AGE (victims age ranges)

    * majority of victims survive
    * we split the age groups into three categories (minors under 18 / adults 18-40 and 40+) 
    * keep in mind complications depending on age when getting treated
    * percentages of victims based on age ranges
    * we split the age groups into three categories (minors under 18 / adults 18-40 and elders 40+) 
    * note that there are complications depending on age

FATALITY ( Y or N)

    * mostly survived for the pharamaceutical logistics & transportation of injured people to the hospital
    * assumption we have a percentage of survivals highest and we use it to sell for the
    * We check unique values and make sure it is only two values Y or N and deleted all rows that have other values
    * assumption we have a high percentage of survivals and we use it to sell the idea to profit from selling products to hospitals

INJURY TYPE

    * clean the type of injury by severity
    * seperate the injury type into different severity
    * seperate the injury type into different body parts
    * treatment depends on type of injury and thus the supplies as well



## CLEAN THE DATA
-------

In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
print(df)


In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.head(5)







In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.columns

In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex']


In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex'].info

In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex'].info

In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.shape

In [10]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")

df['Sex'].unique

# NaN , F, M

<bound method Series.unique of 0        M
1        M
2        F
3        M
4        M
      ... 
556    NaN
557      M
558      M
559      F
560      M
Name: Sex, Length: 561, dtype: object>

In [None]:
df['Sex'].describe()

count     480
unique      2
top         M
freq      415
Name: Sex, dtype: object

In [4]:
print(415/560)

0.7410714285714286


In [None]:
df['Sex'].value_counts()
# we have 415 M and 65 F so where is the rest of the data rows? it has NULL or other data ?
# clean this data!

Sex
M    415
F     65
Name: count, dtype: int64

In [13]:
df_new = df['Sex'].dropna()

In [14]:
df_new.unique

<bound method Series.unique of 0      M
1      M
2      F
3      M
4      M
      ..
555    M
557    M
558    M
559    F
560    M
Name: Sex, Length: 480, dtype: object>

In [15]:
df_new.describe()

count     480
unique      2
top         M
freq      415
Name: Sex, dtype: object

In [9]:
df_new.head()

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
