# SHARK ATTACK PROJECT

The company Surf Legal aims to open a new asset, however does not have know-how to decide where to build its first board producing facility. In order to find the best geographical location, Surf Legal wants to develop a market research based on a csv file, which contains the historical shark attacks data.

The market research must answer the following questions:
    
    - Which geographical location is the best for the company?
    - Which client profile should the company prioritize?

**Assumptions**

    - Only surfing and standing activities should be considered; 
    - The period considered for the market research is the last 20 years;
    - The location with more entries is the location with more board use.

**Conclusion**



**Importando os pacotes que serão utilizados no projeto**

In [1]:
# importar os pacotes que serão utilizados para desenvolver o projeto
import pandas as pd
import numpy as np

**Importando o banco de dados que será utilizado**

In [2]:
# usar o pandas para ler a base de dados 'attacks.csv' e atribuir ele a variavel df (dataframe)
df = pd.read_csv('attacks.csv', encoding='latin8')

# usar o comando head para visualizar uma prévia do dataframe
df.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [3]:
# checar as informações gerais do df e verificar quais colunas possuem valores missings
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null 

**Criação de um novo DataFrame com as Colunas de Interesse**

In [4]:
#Criando um novo dataframe somente com as colunas que serão utilizadas na analise
shark_df = df[['Year','Country','Area','Location', 'Activity','Sex ','Age']].copy()
shark_df

Unnamed: 0,Year,Country,Area,Location,Activity,Sex,Age
0,2018.0,USA,California,"Oceanside, San Diego County",Paddling,F,57
1,2018.0,USA,Georgia,"St. Simon Island, Glynn County",Standing,F,11
2,2018.0,USA,Hawaii,"Habush, Oahu",Surfing,M,48
3,2018.0,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,M,
4,2018.0,MEXICO,Colima,La Ticla,Free diving,M,
...,...,...,...,...,...,...,...
25718,,,,,,,
25719,,,,,,,
25720,,,,,,,
25721,,,,,,,


In [5]:
#Corrigindo o nome da coluna
shark_df.rename(columns={'Sex ':'Sex'}, inplace=True)

In [6]:
#Analisando os valores distintos da coluna Activity para descobrir quais esportes usam prancha
shark_df.Activity.unique()

array(['Paddling', 'Standing', 'Surfing', ...,
       'Crew swimming alongside their anchored ship',
       '4 men were bathing', 'Wreck of  large double sailing canoe'],
      dtype=object)

**Limpando a Coluna Year**

In [7]:
# Mudando a coluna Year para númerica
shark_df['Year'] = pd.to_numeric(shark_df['Year'], errors = 'coerce').copy()

#Removendo os valores missing da coluna Year
shark_df = shark_df.dropna(subset = ['Year']).copy()

#Conferindo se os valores missings foram excluidos
shark_df.Year.unique()

array([2018., 2017., 2016., 2015., 2014., 2013., 2012., 2011., 2010.,
       2009., 2008., 2007., 2006., 2005., 2004., 2003., 2002., 2001.,
       2000., 1999., 1998., 1997., 1996., 1995., 1984., 1994., 1993.,
       1992., 1991., 1990., 1989., 1969., 1988., 1987., 1986., 1985.,
       1983., 1982., 1981., 1980., 1979., 1978., 1977., 1976., 1975.,
       1974., 1973., 1972., 1971., 1970., 1968., 1967., 1966., 1965.,
       1964., 1963., 1962., 1961., 1960., 1959., 1958., 1957., 1956.,
       1955., 1954., 1953., 1952., 1951., 1950., 1949., 1948., 1848.,
       1947., 1946., 1945., 1944., 1943., 1942., 1941., 1940., 1939.,
       1938., 1937., 1936., 1935., 1934., 1933., 1932., 1931., 1930.,
       1929., 1928., 1927., 1926., 1925., 1924., 1923., 1922., 1921.,
       1920., 1919., 1918., 1917., 1916., 1915., 1914., 1913., 1912.,
       1911., 1910., 1909., 1908., 1907., 1906., 1905., 1904., 1903.,
       1902., 1901., 1900., 1899., 1898., 1897., 1896., 1895., 1894.,
       1893., 1892.,

In [8]:
# Convertendo os valores da coluna Year para Int
shark_df = shark_df.astype({'Year': int}).copy()
shark_df.Year.value_counts()

2015    143
2017    136
2016    130
2011    128
2014    127
       ... 
1742      1
1738      1
1638      1
5         1
1543      1
Name: Year, Length: 249, dtype: int64

**Aplicando as condições de Interesse no DataFrama**

In [9]:
# criando uma mascara para selecionar apenas os esportes com prancha
conditions = (shark_df['Activity'] == 'Standing') | (shark_df['Activity'] == 'Surfing')

# Aplicando a mascara no dataframe
shark_df = shark_df[conditions].copy()
shark_df

Unnamed: 0,Year,Country,Area,Location,Activity,Sex,Age
1,2018,USA,Georgia,"St. Simon Island, Glynn County",Standing,F,11
2,2018,USA,Hawaii,"Habush, Oahu",Surfing,M,48
3,2018,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,M,
9,2018,USA,Florida,"Daytona Beach, Volusia County",Standing,M,12
15,2018,SOUTH AFRICA,Eastern Cape Province,"Nahoon Beach, East London",Surfing,M,
...,...,...,...,...,...,...,...
6097,1828,USA,Hawaii,"Uo, Lahaina, Maui",Surfing,M,
6143,1779,USA,Hawaii,"Maliu, Hawai'i",Surfing,M,young
6201,0,USA,Florida,"Lost Tree Village, Palm Beach County",Surfing,M,
6249,0,USA,Florida,"Palm Beach, Palm Beach County",Standing,M,


In [10]:
# criando uma mascara para selecionar apenas os anos de interrese
year_cut = (shark_df['Year'] >= 2000)

# Aplicando a mascara no dataframe
shark_df = shark_df[year_cut].copy()
shark_df

Unnamed: 0,Year,Country,Area,Location,Activity,Sex,Age
1,2018,USA,Georgia,"St. Simon Island, Glynn County",Standing,F,11
2,2018,USA,Hawaii,"Habush, Oahu",Surfing,M,48
3,2018,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,M,
9,2018,USA,Florida,"Daytona Beach, Volusia County",Standing,M,12
15,2018,SOUTH AFRICA,Eastern Cape Province,"Nahoon Beach, East London",Surfing,M,
...,...,...,...,...,...,...,...
2056,2000,PAPUA NEW GUINEA,Madang Province,"Long Island near Madang, about 500 km (310 mil...",Standing,M,9
2063,2000,USA,Florida,"Floridana Beach, Brevard County",Surfing,M,37
2065,2000,AUSTRALIA,New South Wales,"McMasters Beach, Central Coast",Surfing,M,
2074,2000,NEW ZEALAND,South Island,Oreti Beach (reported as the 4th person bitten...,Surfing,M,12


In [11]:
#verificando quais os valores da coluna Sex
shark_df.Sex.value_counts()

M      591
F       64
lli      1
Name: Sex, dtype: int64

In [12]:
# criando uma mascara para selecionar apenas os sexos Masculinos e Feminino
sex_cut = (shark_df['Sex'] == 'M' ) | (shark_df['Sex'] == 'F')

# Aplicando a mascara no dataframe
shark_df = shark_df[sex_cut].copy()
shark_df

Unnamed: 0,Year,Country,Area,Location,Activity,Sex,Age
1,2018,USA,Georgia,"St. Simon Island, Glynn County",Standing,F,11
2,2018,USA,Hawaii,"Habush, Oahu",Surfing,M,48
3,2018,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,M,
9,2018,USA,Florida,"Daytona Beach, Volusia County",Standing,M,12
15,2018,SOUTH AFRICA,Eastern Cape Province,"Nahoon Beach, East London",Surfing,M,
...,...,...,...,...,...,...,...
2056,2000,PAPUA NEW GUINEA,Madang Province,"Long Island near Madang, about 500 km (310 mil...",Standing,M,9
2063,2000,USA,Florida,"Floridana Beach, Brevard County",Surfing,M,37
2065,2000,AUSTRALIA,New South Wales,"McMasters Beach, Central Coast",Surfing,M,
2074,2000,NEW ZEALAND,South Island,Oreti Beach (reported as the 4th person bitten...,Surfing,M,12


In [13]:
# verificando se os filtros foram aplicados corretamente no dataframe
print(shark_df.Activity.unique())
print(' ')
print(shark_df.Year.unique())
print(' ')
print(shark_df.Sex.unique())

['Standing' 'Surfing']
 
[2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005
 2004 2003 2002 2001 2000]
 
['F' 'M']


**Verificando as informações do novo dataframe**

In [14]:
# verificando em quais colunas temos valores missings
shark_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 655 entries, 1 to 2075
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Year      655 non-null    int32 
 1   Country   654 non-null    object
 2   Area      651 non-null    object
 3   Location  650 non-null    object
 4   Activity  655 non-null    object
 5   Sex       655 non-null    object
 6   Age       539 non-null    object
dtypes: int32(1), object(6)
memory usage: 38.4+ KB


In [15]:
# checando o tamanho do novo dataframe
shark_df.shape

(655, 7)

**Limpando a coluna Age**

In [16]:
# iniciando o tratamento pela coluna com mais valores missings
# verificando os valores unicos da coluna Age, e identificando quais valores não estão corretos
shark_df.Age.unique()

array(['11', '48', nan, '12', '60', '41', '37', '19', '18', '20', '54',
       '35', '14', '24', '25', '31', '33', '28', '42', '17', '13', '58',
       '16', '65', '36', '29', '21', '43', '22', '9', '27', '15', '32',
       '38', '52', '47', '40', '45', '34', '23', '26', '50', '44', '50s',
       '51', '8', 'teen', '30', '7', '10', '39', '30s', '61', '53', '63',
       '49', '46', 'Teen', '55', '68', '59', '57', '30 or 36', '33 or 37'],
      dtype=object)

In [17]:
# corrigindo os casos com erro que era possível identificar o valor real
shark_df.loc['30s', 'Age'] = '30'
shark_df.loc['50s', 'Age'] = '50'

In [18]:
# convertendo os valores da coluna Age para Númericos
# e transformando os erros que não são possíveis identificar a idade real para valores nulos.
shark_df['Age'] = pd.to_numeric(shark_df['Age'], errors = 'coerce')

In [19]:
# calculando a mediana das idades e arrendondando para um número inteiro
age_median= round(shark_df[shark_df.Age.notnull()].Age.astype(int).median())
age_median

25

In [20]:
# atribuindo para os valores missings a mediana das idades
shark_df.Age.fillna(age_median, inplace=True)

In [21]:
#checando as alterações no dataframe
shark_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 657 entries, 1 to 50s
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      655 non-null    float64
 1   Country   654 non-null    object 
 2   Area      651 non-null    object 
 3   Location  650 non-null    object 
 4   Activity  655 non-null    object 
 5   Sex       655 non-null    object 
 6   Age       657 non-null    float64
dtypes: float64(2), object(5)
memory usage: 61.1+ KB


**Limpando a coluna Location**

In [22]:
# considerando os casos geograficos (Country, Area, Location)
# vamos iniciar eliminando os valores nulos na menor granulometria do drill down

shark_df = shark_df.dropna(subset=['Location']).copy()
shark_df.Location.value_counts()

New Smyrna Beach, Volusia County                 87
Ponce Inlet, Volusia County                       9
Ponce Inlet, New Smyrna Beach, Volusia County     8
Melbourne Beach, Brevard County                   6
Daytona Beach, Volusia County                     5
                                                 ..
Guarajuba                                         1
Clifton Beach                                     1
Bittern                                           1
Stuart Rocks, Martin County                       1
Kawa'a                                            1
Name: Location, Length: 472, dtype: int64

**Limpando a coluna Area**

In [23]:
# criando uma mascara para verificar em quais linhas estão os valores nulos da coluna Area
mask1 = shark_df.Area.isnull()
mask2 = mask1[mask1 == True]
mask2

229     True
388     True
1603    True
Name: Area, dtype: bool

In [24]:
# Verificando qual a localização correspondente dos valores missings da Area
print(shark_df.Location[229])
print(shark_df.Location[388])
print(shark_df.Location[1603])

Boucan Canot
St. Leu
Punta Caracas


In [25]:
# Atribuindo a coluna Area os valores correspondentes da coluna Location
shark_df.loc[229, 'Area'] = 'Saint Gilles'
shark_df.loc[388, 'Area'] = 'Saint Gilles'
shark_df.loc[1603, 'Area'] = 'Punta Caracas'

In [26]:
# verificando se todos os valores nulos foram tratados
shark_df.Area.isnull().sum()

0

In [27]:
# Verificando os valores distintos da coluna Area e checando se existe algum erro
shark_df.Area.unique()

array(['Georgia', 'Hawaii', 'New South Wales', 'Florida',
       'Eastern Cape Province', 'Western Australia', 'New Providence',
       'Victoria', 'Fernando de Noronha', 'California', 'New York',
       'Shizuoka Prefecture', 'Ascension Island', 'Washington',
       'Marquesas', 'Western Cape Province', 'South Carolina', 'Bali',
       'Queensland', 'Oregon', 'Saint Gilles', 'Kochi Prefecture',
       'Tasmania', 'North Carolina', 'Guanacaste', 'Le Port',
       'South Australia', 'Saint-Gilles-les-Bains', 'Pernambuco',
       'Delaware', 'Atsumi peninsula', 'South Island',
       'Santa Cruz Island', 'British Colombia', 'North Island',
       'Santa Elena', 'Moray', 'Puerto Rico', 'Society Islands',
       'Saint-Gilles', 'Saint Gilles ', 'Vitu Levu', 'Abaco Islands',
       'Virginia', 'Texas', 'Saint-Benoit', 'Rio Grande Do Sul', 'Dubai',
       'Western Province', 'Eastern Province', 'KwaZulu-Natal',
       'South Province', 'Galapagos Islands', 'Bahia', 'Guerro',
       'Baja Cal

**Limpando a coluna Country**

In [28]:
# verificando se a coluna Country possuí valores nulos
shark_df.Country.isnull().sum()

0

In [29]:
#verificando os valores distintos da coluna e verificando algum erro aparente
shark_df.Country.unique()

array(['USA', 'AUSTRALIA', 'SOUTH AFRICA', 'BAHAMAS', 'BRAZIL', 'JAPAN',
       'ST HELENA, British overseas territory', 'FRENCH POLYNESIA',
       'INDONESIA', 'REUNION', 'COSTA RICA', 'NEW ZEALAND', 'ECUADOR',
       'CANADA', 'SCOTLAND', 'FIJI', 'UNITED ARAB EMIRATES (UAE)',
       'NEW CALEDONIA', 'MEXICO', 'ST. MAARTIN', 'MEXICO ', 'VENEZUELA',
       'URUGUAY', 'OKINAWA', 'PAPUA NEW GUINEA'], dtype=object)

In [41]:
# Mexico aparece duas vezes na lista, e Japão e Okinawa estão separados
# Para corrigir o erro vamos substituir os valores com o metodo replace

shark_df['Country'].replace(['OKINAWA'], 'JAPAN', inplace = True)

shark_df['Country'].replace(['MEXICO '], 'MEXICO', inplace = True)

In [42]:
# Verificando se os valores foram corrigidos
shark_df.Country.unique()

array(['USA', 'AUSTRALIA', 'SOUTH AFRICA', 'BAHAMAS', 'BRAZIL', 'JAPAN',
       'ST HELENA, British overseas territory', 'FRENCH POLYNESIA',
       'INDONESIA', 'REUNION', 'COSTA RICA', 'NEW ZEALAND', 'ECUADOR',
       'CANADA', 'SCOTLAND', 'FIJI', 'UNITED ARAB EMIRATES (UAE)',
       'NEW CALEDONIA', 'MEXICO', 'ST. MAARTIN', 'VENEZUELA', 'URUGUAY',
       'PAPUA NEW GUINEA'], dtype=object)

**Conferindo as informações gerais do Dataframe limpo**

In [46]:
# verificando se ainda temos valores nulos no Dataframe
shark_df.isnull().sum()

Year          0
Country       0
Area          0
Location      0
Activity      0
Sex           0
Age           0
MEXICO      649
OKINAWA     649
dtype: int64

In [50]:
# foi observado a criação das colunas 'MEXICO ' e 'OKINAWA'
# por isso vamos verificar as informações do dataframe

shark_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 650 entries, 1 to 2075
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      650 non-null    float64
 1   Country   650 non-null    object 
 2   Area      650 non-null    object 
 3   Location  650 non-null    object 
 4   Activity  650 non-null    object 
 5   Sex       650 non-null    object 
 6   Age       650 non-null    float64
 7   MEXICO    1 non-null      object 
 8   OKINAWA   1 non-null      object 
dtypes: float64(2), object(7)
memory usage: 70.8+ KB


In [53]:
# não tivemos alterações no número de linhas do dataframe
# vamos eliminar as colunas 'MEXICO ' e 'OKINAWA'

shark_df.drop(columns=['MEXICO ', 'OKINAWA'], inplace=True)

In [72]:
# verificando se a correção foi aplicada
shark_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 650 entries, 1 to 2075
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      650 non-null    float64
 1   Country   650 non-null    object 
 2   Area      650 non-null    object 
 3   Location  650 non-null    object 
 4   Activity  650 non-null    object 
 5   Sex       650 non-null    object 
 6   Age       650 non-null    float64
dtypes: float64(2), object(5)
memory usage: 60.6+ KB


In [57]:
# visualizando o dataframe

shark_df

Unnamed: 0,Year,Country,Area,Location,Activity,Sex,Age
1,2018.0,USA,Georgia,"St. Simon Island, Glynn County",Standing,F,11.0
2,2018.0,USA,Hawaii,"Habush, Oahu",Surfing,M,48.0
3,2018.0,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,M,25.0
9,2018.0,USA,Florida,"Daytona Beach, Volusia County",Standing,M,12.0
15,2018.0,SOUTH AFRICA,Eastern Cape Province,"Nahoon Beach, East London",Surfing,M,25.0
...,...,...,...,...,...,...,...
2056,2000.0,PAPUA NEW GUINEA,Madang Province,"Long Island near Madang, about 500 km (310 mil...",Standing,M,9.0
2063,2000.0,USA,Florida,"Floridana Beach, Brevard County",Surfing,M,37.0
2065,2000.0,AUSTRALIA,New South Wales,"McMasters Beach, Central Coast",Surfing,M,25.0
2074,2000.0,NEW ZEALAND,South Island,Oreti Beach (reported as the 4th person bitten...,Surfing,M,12.0


Respondendo a primeira pergunta:

**Dentre as cidades analisadas no Banco de Dados, qual a melhor localização para abrir uma fábrica de pranchas?**


In [60]:
# Vamos utilizar a coluna Location para identificar as 10 regiões com maior número de casos
shark_df.Location.value_counts().head(10)

New Smyrna Beach, Volusia County                                 87
Ponce Inlet, Volusia County                                       9
Ponce Inlet, New Smyrna Beach, Volusia County                     8
Melbourne Beach, Brevard County                                   6
Daytona Beach, Volusia County                                     5
Nahoon, East London                                               5
San Onofre State Beach, San Diego County                          4
Ormond Beach, Volusia County                                      4
Balian                                                            4
Playalinda Beach, Canaveral National Seashore, Brevard County     4
Name: Location, dtype: int64

In [95]:
# Verificamos que a região de Volusia County aparece com frequencia no top 10
# por isso vamos analisar quantos incidentes ocorreram naquela região

var_list = list(shark_df.Location.values)

# Vamos utilizar REGEX para contabilizar quantas casos ocorreram nessa região
import re


n_list=list(map(lambda x: re.findall('[V][o][l][u][s][i][a][ ][C][o][u][n][t][y]', x), var_list))

volusia_count = []

for i in n_list:
    if i == ['Volusia County']:
       volusia_count.append(i)
    
print(len(volusia_count))

132


In [97]:
# observamos no top 10 que a segunda região com mais ataques fica localizada é Melbourne Beach, Brevard County
# por isso vamos verificar se a Australia possuí mais que 132 registros

shark_df.Country.value_counts()

USA                                      387
AUSTRALIA                                146
SOUTH AFRICA                              43
BRAZIL                                    13
REUNION                                   10
NEW ZEALAND                                9
MEXICO                                     6
INDONESIA                                  6
ECUADOR                                    4
JAPAN                                      4
COSTA RICA                                 4
BAHAMAS                                    3
VENEZUELA                                  2
NEW CALEDONIA                              2
FRENCH POLYNESIA                           2
FIJI                                       2
SCOTLAND                                   1
ST HELENA, British overseas territory      1
CANADA                                     1
ST. MAARTIN                                1
URUGUAY                                    1
PAPUA NEW GUINEA                           1
UNITED ARA

In [101]:
# Como a Australia possuí 146 registros, vamos aplicar novamente o procedimento para Brevard County

n2_list=list(map(lambda x: re.findall('[B][r][e][v][a][r][d][ ][C][o][u][n][t][y]', x), var_list))

brevard_count = []

for i in n2_list:
    if i == ['Brevard County']:
       brevard_count.append(i)
    
print(len(brevard_count))

34


In [108]:
#Pra finalizar as analises, vamos verificar a porcentagem correspondente dos casos em Volusia County

porc = (len(volusia_count)/shark_df.Location.count())*100
porc.round(2)

20.31

**R:** Com 132 casos (20,31% dos registros), podemos afirmar que a região de Volusia County, localizada na Florida, é o local ideal para abrir uma fábrica de pranchas.

Definido em qual região vamos abrir a fábrica, podemos responder a segunda pergunta:

**Qual o perfil do cliente que vamos direcionar as vendas?**


In [142]:
# Verificando a proporção entre homens e mulheres em todo o dataframe

shark_df.Sex.value_counts(normalize=True).round(2)*100

M    90.0
F    10.0
Name: Sex, dtype: float64

In [143]:
# Verificando a proporção entre homens e mulheres na região da fábrica

persona = shark_df.loc[shark_df['Area']=='Florida', 'Sex']
persona.value_counts(normalize=True).round(2)*100

M    88.0
F    12.0
Name: Sex, dtype: float64

In [124]:
# Verificando a mediana das idades em todo o Dataframe

shark_df.groupby('Sex')['Age'].median()

Sex
F    24.0
M    25.0
Name: Age, dtype: float64

In [145]:
# Verificando a mediana das idades na região da fábrica

shark_df[shark_df['Area']=='Florida'].groupby(['Sex'], as_index=False).agg(age_median = ('Age','median'))

Unnamed: 0,Sex,age_median
0,F,20.0
1,M,23.0


**R:** Baseado nas analises acima, podemos dividir o perfil de clientes que compram as pranchas em dois grupos:

- Mulheres de 20 a 25 anos
- Homens de 20 a 25 anos

Sendo que a persona criada pelos dados é:

  **Homem de 23 anos de idade, morador da Florida**