# PROJECT 02 :
-----

### BRIEF : data analysis of global shark attacks for a business idea
You will initially examine the Shark Attack dataset, understanding its structure and formulating a hypothesis or several hypotheses about the data. 
 We hypothesize that shark attacks are more common in certain locations and peak during specific months.
We define a Business Case, such as 

* ‚ÄúAs a company that sells medical products, I want to identify destinations with high shark attack rates.‚Äù
* ‚ÄúAs a company providing supply transportation services, I want to know when and where shark attacks peak to plan the safe transport of medical supplies to hospitals.‚Äù

 Throughout the project, we will use Python and the pandas library to apply at least five data cleaning techniques to handle missing values, duplicates, and formatting inconsistencies. After cleaning, we will perform basic exploratory data analysis to validate our hypotheses and extract insights. 

üìù BUSINESS IDEA ‚Äî 3 Bullet Points

* Problem to Solve: Coastal hospitals and emergency response teams are not always prepared with the right medical supplies during periods of high shark-attack frequency. This business solves the problem by predicting when and where attacks are most likely, so medical supplies can be stocked in advance.

* Business Concept: Use historical shark attack data to create global heatmaps and seasonal risk forecasts. Then provide pharmaceutical products (painkillers, antibiotics, blood bags, emergency kits) and transportation support to hospitals and ambulances near high-risk beaches.

* Data Used to Profit: The business will analyze Country, Date/Month, Gender, Age, Fatality, and Type of Injury to identify high-risk locations, peak attack months, and most common injury types. This allows optimized supply production, targeted sales, and efficient delivery to the areas that need it most.



## üåÄCOLUMNS TO CLEAN : 
-----

**- COUNTRY (global comparison between countries to invest in more)**@Blanca

    * we just take the country column and make sure every country name is accurately named
    * We are gonna check the column of COUNTRY and make sure every country name is correct
    * We remove NULL COUNTRY data rows that dont have any country
    * We need to make sure that the name of COUNTRY  is capitalized and written the same way for example :
        United States of Amercia == USA == US
        it has to have the same name and consistent!

**- DATE ( MONTH + YEAR )**@Blanca

    * split the date into day - month columns and only use month column
    * interpret months which have shark attacks happen the most

**- GENDER ( F or M )** @Cecilia

    * We check unique values and make it so it is only two values F or M and deleted all rows that have other values
    * we noticed mostly M get attacked
    * percentages Male to Female victims
    * We remove NULL DATE data rows that dont have a date or that the date doesnt include a month and a year
    * We are gonna check the column of DATE and seperate it into three columns DAY + MONTH + YEAR
    * We verify that the new YEAR column matches with the old YEAR column and keep the ones that match
        * NEW YEAR COLUMN is the one split from the DATE column
        * OLD YEAR COLUMN is the one already existing in the original sheet
    * Once we finish comparing the new YEAR column vs old YEAR column and we find them not matching on some data rows. we remove the none matching ones so we keep clean data of accurate years
    * We remove the DAY and OLD YEAR columns
    * We are gonna keep the MONTH and matching YEAR

**- AGE (victims age ranges)** 

    * majority of victims survive
    * we split the age groups into three categories (minors under 18 / adults 18-40 and 40+) 
    * keep in mind complications depending on age when getting treated
    * percentages of victims based on age ranges
    * we split the age groups into three categories (minors under 18 / adults 18-40 and elders 40+) 
    * note that there are complications depending on age

**- FATALITY ( Y or N )** @Cecilia

    * mostly survived for the pharamaceutical logistics & transportation of injured people to the hospital
    * assumption we have a percentage of survivals highest and we use it to sell for the
    * We check unique values and make sure it is only two values Y or N and deleted all rows that have other values
    * assumption we have a high percentage of survivals and we use it to sell the idea to profit from selling products to hospitals

**- INJURY TYPE**

    * clean the type of injury by severity
    * seperate the injury type into different severity
    * seperate the injury type into different body parts
    * treatment depends on type of injury and thus the supplies as well



## CLEAN THE DATA
-------

In [8]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
print(df)


                         Date    Year          Type     Country  \
0                 25 Aug 2023  2023.0    Unprovoked   AUSTRALIA   
1                 21 Aug-2023  2023.0  Questionable     BAHAMAS   
2                 07-Jun-2023  2023.0    Unprovoked     BAHAMAS   
3                 02-Mar-2023  2023.0    Unprovoked  SEYCHELLES   
4                 18-Feb-2023  2023.0  Questionable   ARGENTINA   
..                        ...     ...           ...         ...   
556                      1733  1733.0       Invalid     ICELAND   
557                      1723  1723.0    Unprovoked      ROATAN   
558  Late 1600s Reported 1728  1642.0       Invalid      GUINEA   
559               Before 1824     0.0    Unprovoked   AUSTRALIA   
560      No date, Before 1963     0.0       Invalid         USA   

                      State                           Location  \
0           New South Wales  Lighthouse Beach, Port Macquarie    
1    New Providence   Isoad             Saunders Beach, Nassau 

In [9]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.head(5)







Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
0,25 Aug 2023,2023.0,Unprovoked,AUSTRALIA,New South Wales,"Lighthouse Beach, Port Macquarie",Surfing,Toby Begg,M,44,Severe injuries to lower limbs,,10h00,"White shark, 3.8-4.2m","B. Myatt, & M. Michaelson, GSAF"
1,21 Aug-2023,2023.0,Questionable,BAHAMAS,New Providence Isoad,"Saunders Beach, Nassau",,male,M,20/30,Body found with shark bites. Possible drowning...,,Morning,,"The Tribune, 8/21/2023"
2,07-Jun-2023,2023.0,Unprovoked,BAHAMAS,Freeport,Shark Junction,Scuba diving,Heidi Ernst,F,73,Calf severely bitten,,13h00,Caribbean rreef shark,"J. Marchand, GSAF"
3,02-Mar-2023,2023.0,Unprovoked,SEYCHELLES,Praslin Island,,Snorkeling,Arthur ‚Ä¶,M,6,Left foot bitten,,Afternoon,Lemon shark,"Midlibre, 3/18/2023"
4,18-Feb-2023,2023.0,Questionable,ARGENTINA,Patagonia,Chubut Province,,Diego Barr√≠a,M,32,Death by misadventure,,,,"El Pais, 2/27/2023"


In [10]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.columns

Index(['Date', 'Year', 'Type', 'Country', 'State', 'Location', 'Activity',
       'Name', 'Sex', 'Age', 'Injury', 'Fatal Y/N', 'Time', 'Species ',
       'Source'],
      dtype='object')

In [11]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex']


0        M
1        M
2        F
3        M
4        M
      ... 
556    NaN
557      M
558      M
559      F
560      M
Name: Sex, Length: 561, dtype: object

In [12]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex'].info

<bound method Series.info of 0        M
1        M
2        F
3        M
4        M
      ... 
556    NaN
557      M
558      M
559      F
560      M
Name: Sex, Length: 561, dtype: object>

In [2]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex'].unique

<bound method Series.unique of 0        M
1        M
2        F
3        M
4        M
      ... 
556    NaN
557      M
558      M
559      F
560      M
Name: Sex, Length: 561, dtype: object>

In [14]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.shape

(561, 15)

In [3]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")

df['Sex'].value_counts(dropna=False)

# NaN , F, M

Sex
M      415
NaN     81
F       65
Name: count, dtype: int64

In [None]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex'] = df['Sex'].replace({
    'm' : 'M', 
    'f' : 'F', 
    'Male' : 'M', 
    'Female' : 'F'})

df['Sex'] = df['Sex'].str.upper()

In [21]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.dropna (subset = ['Sex'], inplace= True)
df['Sex'].value_counts(dropna=False)

Sex
M    415
F     65
Name: count, dtype: int64

In [25]:
import pandas as pd
df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Fatal Y/N']

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
556   NaN
557   NaN
558   NaN
559   NaN
560   NaN
Name: Fatal Y/N, Length: 561, dtype: float64

In [26]:
import pandas as pd
df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Fatal Y/N'].info

<bound method Series.info of 0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
556   NaN
557   NaN
558   NaN
559   NaN
560   NaN
Name: Fatal Y/N, Length: 561, dtype: float64>

In [28]:
import pandas as pd
df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Fatal Y/N'].unique

<bound method Series.unique of 0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
556   NaN
557   NaN
558   NaN
559   NaN
560   NaN
Name: Fatal Y/N, Length: 561, dtype: float64>

In [29]:
import pandas as pd
df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df['Fatal Y/N'].value_counts(dropna=False)

Fatal Y/N
NaN    561
Name: count, dtype: int64

In [30]:
import pandas as pd

df = pd.read_csv(r"C:\Users\cecil\OneDrive\Dokumente\GitHub\02_project-shark-attacks\data\raw.csv")
df.dropna (subset = ['Fatal Y/N'], inplace= True)
df['Fatal Y/N'].value_counts(dropna=False)

Series([], Name: count, dtype: int64)