# Introdução

Parabéns! Você é a mais nova cientista de dados da BlueToucan Medical, uma empresa multinacional da área de farmacos para o tratamento diversos tipos de câncer. Em reunião a direção da empresa solicitou que você fosse responsável por avaliar os dados da última pesquisa feita em parceira com os hospitais da Universidade de Wisconsin.

# Problema

Câncer de mama é o tipo de câncer mais comum entre as mulheres no mundo e no Brasil, depois do câncer de pele não melanoma. O câncer de mama responde, atualmente, por cerca de 28% dos casos novos de câncer em mulheres. O câncer de mama também acomete homens, porém é raro, representando menos de 1% do total de casos da doença. Estatísticas indicam aumento da sua incidência tanto nos países desenvolvidos quanto nos em desenvolvimento. Existem vários tipos de câncer de mama. Alguns evoluem de forma rápida, outros, não. A maioria dos casos tem bom prognóstico.

Os executivos da BlueToucan precisam de tomar algumas decisões ligadas a produção de um remédio para combater o câncer de mama. Por isso encomendaram uma análise de dados com você, sua missão é utilizar os dados coletados para extrair o máximo de informação possível sobre as características do câncer, dos indivíduos, onde moram e quais os fatores que aparentam ter relação com a grande quantidade de câncer de mama.

# Os dados

## breast_cancer_data.csv

> mean_radius: raio médio dos caroços retirados<br>
> mean_texture: textura média dos caroços retirados<br>
> mean_perimenter: perímetro médio dos caroços retirados<br>
> mean_area: área média dos caroços retirados<br>
> mean_smoothness: suavidade média dos caroços retirados<br>
> diagnosis: diagnóstico (1 - canceroso, 0 - não canceroso)<br>
> age: idade da paciente<br>
> name: nome da paciente<br>
> zipcode: código da cidade de residência da paciente<br>
> diabetes: paciente diagnosticado com diabetes (0 - sem diabetes, 1 - com diabetes<br>
> family_history: paciente com histórico familiar de câncer de mama<br>

## median_hh_income.csv

> COUNTY: nome da municipalidade<br>
> COUNT: mediana de renda anual<br>

## percentage_no_health_insurance.csv

> COUNTY: nome da municipalidade<br>
> COUNT: porcentagem da população sem plano de saúde<br>

## toxic_air_arsenic.csv

> COUNTY: nome da municipalidade<br>
> POUNDS: quantidade (em pounds) de arsênico liberado no ar.<br>

## wi_county_data.csv

> ZIP: código da cidade<br>
> COUNTY: nome da municipalidade<br>

## wi_regions.txt

> Documento de texto descrevendo regiões e suas respectivas municipalidades.

## WI (geojson)

> Pasta contendo geojson descrevendo o formato de todas as municipalidades.

In [37]:
import pandas as pd
import numpy as np

In [2]:
arr = np.arange(36).reshape(6,6)
arr

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

In [3]:
df = pd.DataFrame(arr,columns=['idade','altura','tamanho do pe','salario','cor_do_olho','peso'],
                  index=['nasser','joao','marcelo','maria','joana','raquel'])

In [4]:
df

Unnamed: 0,idade,altura,tamanho do pe,salario,cor_do_olho,peso
nasser,0,1,2,3,4,5
joao,6,7,8,9,10,11
marcelo,12,13,14,15,16,17
maria,18,19,20,21,22,23
joana,24,25,26,27,28,29
raquel,30,31,32,33,34,35


In [5]:
df.loc['marcelo','salario']

15

In [6]:
arr2 = np.random.uniform(size=6)
arr2

array([0.15609778, 0.86805997, 0.24078928, 0.68124124, 0.84091238,
       0.69446869])

In [7]:
pd.Series(arr2,index=['nasser','joao','marcelo','maria','joana','raquel'],name='salario_normalizado')

nasser     0.156098
joao       0.868060
marcelo    0.240789
maria      0.681241
joana      0.840912
raquel     0.694469
Name: salario_normalizado, dtype: float64

In [8]:
df['tamanho do pe']

nasser      2
joao        8
marcelo    14
maria      20
joana      26
raquel     32
Name: tamanho do pe, dtype: int64

In [89]:
df = pd.read_csv('../data/breast_cancer/breast_cancer_data.csv')
df.tail()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,name,zipcode,diabetes,family_history
564,10.71,20.39,69.5,344.9,0.1082,1,50,3,Shannon James,53006,0,1
565,12.87,16.21,82.38,512.2,0.09425,1,41,1,Marie Christian,53007,0,1
566,13.59,21.84,87.16,561.0,0.07956,1,43,7,Tracy Morgan,53008,1,1
567,11.74,14.02,74.24,427.3,0.07813,1,48,3,Dawn Smith,53015,0,1
568,7.76,24.54,47.92,181.0,0.05263,1,54,6,Christine Nguyen,53013,0,1


In [93]:
mask = df.mean_radius < 11

In [95]:
df[df.mean_radius < 11].head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,name,zipcode,diabetes,family_history
158,10.95,21.35,71.9,371.1,0.1227,0,32,0,Shelia Henderson,53011,0,0
214,10.57,20.22,70.15,338.3,0.09073,1,43,4,Aimee Turner,53002,0,1
215,10.8,21.98,68.79,359.9,0.08801,1,43,1,Sophia Johnson,53001,1,0
222,10.48,14.98,67.49,333.6,0.09816,1,52,8,Ashley Wise,53014,0,1
229,10.03,,63.19,307.3,0.08117,1,38,0,Brenda Jones,53016,1,1


In [98]:
df[(df.mean_radius < 11) | (df.diagnosis == 0)]

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,name,zipcode,diabetes,family_history
158,10.95,21.35,71.9,371.1,0.1227,0,32,0,Shelia Henderson,53011,0,0


In [112]:
df.shape

(569, 12)

In [111]:
df.zipcode.unique()

array([53013, 53007, 53001, 53006, 53008, 53016, 53011, 53003, 53005,
       53015, 53004, 53010, 53002, 53014, 53012])

In [115]:
"{}{}{}".format(5,22,45)

'52245'

In [117]:
x = 3

f"hoje eu tomei {x} litros de água"

'hoje eu tomei 3 litros de água'

In [126]:
df.query(" zipcode == 53013 & diagnosis == 1").mean_area.var()

14321.646397058823

In [108]:
for zipcode in df.zipcode.unique():
    print(zipcode,df.query(f" zipcode == {zipcode} & diagnosis == 1").mean_area.mean())

53013 489.45294117647063
53007 519.7
53001 431.847619047619
53006 443.0642857142857
53008 489.01000000000005
53016 474.27
53011 448.76562499999994
53003 493.86
53005 441.85
53015 426.22222222222223
53004 457.4222222222222
53010 479.4121951219512
53002 453.44444444444446
53014 417.178947368421
53012 477.825


In [130]:
df.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,name,zipcode,diabetes,family_history
0,17.99,10.38,122.8,1001.0,0.1184,0,33,1,Abigail Shaffer,53013,0,0
1,14.22,23.12,94.37,609.9,0.1075,0,25,7,Tiffany Miller,53007,0,0
2,12.34,26.86,81.15,477.4,0.1034,0,39,8,Anna Walker,53001,0,1
3,14.86,23.21,100.4,,0.1044,0,30,5,Elizabeth Perkins,53006,0,1
4,13.77,22.29,90.63,588.9,0.12,0,29,1,Erin Warner,53001,0,0


In [131]:
df.tail()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,name,zipcode,diabetes,family_history
564,10.71,20.39,69.5,344.9,0.1082,1,50,3,Shannon James,53006,0,1
565,12.87,16.21,82.38,512.2,0.09425,1,41,1,Marie Christian,53007,0,1
566,13.59,21.84,87.16,561.0,0.07956,1,43,7,Tracy Morgan,53008,1,1
567,11.74,14.02,74.24,427.3,0.07813,1,48,3,Dawn Smith,53015,0,1
568,7.76,24.54,47.92,181.0,0.05263,1,54,6,Christine Nguyen,53013,0,1


In [143]:
df.sample(5,random_state=42,)

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,name,zipcode,diabetes,family_history
204,14.42,19.77,94.48,642.5,0.09752,0,27,6,Sharon Cherry,53012,0,1
70,20.16,19.66,131.1,1274.0,0.0802,0,26,5,Kimberly Levine,53008,0,1
131,18.61,20.25,122.1,1094.0,0.0944,0,33,8,Michelle Maldonado,53005,0,1
431,13.49,22.3,86.91,561.0,0.08752,1,42,3,Maureen Lewis,53001,1,0
540,11.89,18.35,77.32,432.2,0.09363,1,46,7,Mary Moore,53012,1,1


In [145]:
df3 = pd.DataFrame(np.arange(25).reshape(5,5))

In [182]:
df3.sample(weights=[0.1,0.1,0.1,0.6,0.1])

Unnamed: 0,0,1,2,3,4
2,10,11,12,13,14


In [183]:
df.sample(5)

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,name,zipcode,diabetes,family_history
396,12.3,15.9,78.83,463.7,0.0808,1,56,3,Meredith Hill,53011,1,0
39,14.95,17.57,96.85,,0.1167,0,22,6,Mary Travis,53005,0,0
458,,14.76,84.74,551.7,0.07355,1,36,0,Sonya Mckee,53006,0,0
554,11.94,20.76,77.87,441.0,0.08605,1,36,1,Karen Young,53010,0,0
218,12.72,17.67,,501.3,0.07896,1,50,7,Carla Guerra,53014,0,1


## to_datetime

In [217]:
data = pd.date_range('2019-08-01','2020-08-01').values.astype(str)
data = pd.DataFrame(data)
data.columns = ['data_ref']
data['infected'] = np.random.randint(0,125,size=367)

In [218]:
data

Unnamed: 0,data_ref,infected
0,2019-08-01T00:00:00.000000000,86
1,2019-08-02T00:00:00.000000000,89
2,2019-08-03T00:00:00.000000000,100
3,2019-08-04T00:00:00.000000000,124
4,2019-08-05T00:00:00.000000000,22
...,...,...
362,2020-07-28T00:00:00.000000000,26
363,2020-07-29T00:00:00.000000000,116
364,2020-07-30T00:00:00.000000000,44
365,2020-07-31T00:00:00.000000000,62


In [219]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   data_ref  367 non-null    object
 1   infected  367 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.9+ KB


In [220]:
data['data_ref'] = pd.to_datetime(data['data_ref'])

In [221]:
data.head()

Unnamed: 0,data_ref,infected
0,2019-08-01,86
1,2019-08-02,89
2,2019-08-03,100
3,2019-08-04,124
4,2019-08-05,22


In [256]:
data['dia_da_semana'] = data['data_ref'].dt.dayofweek
data['semana_do_ano'] = data['data_ref'].dt.isocalendar().week
data['ano'] = data['data_ref'].dt.year

In [248]:
data['data_ref'].min()

Timestamp('2019-08-01 00:00:00')

In [250]:
data['data_ref'].max()

Timestamp('2020-08-01 00:00:00')

## Continuando

In [273]:
df.describe(percentiles=[.25, .5, .75])

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis,age,pregnancies,zipcode,diabetes,family_history
count,554.0,546.0,548.0,544.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.108717,19.295952,92.367792,652.674449,0.09636,0.627417,39.216169,3.769772,53008.44464,0.302285,0.479789
std,3.509903,4.289885,24.351585,349.628827,0.014064,0.483918,10.527664,2.578637,4.550674,0.459652,0.500031
min,6.981,9.71,47.92,143.5,0.05263,0.0,15.0,0.0,53001.0,0.0,0.0
25%,11.6825,16.21,75.2675,419.525,0.08637,0.0,30.0,1.0,53005.0,0.0,0.0
50%,13.375,18.89,86.735,548.75,0.09587,1.0,41.0,4.0,53008.0,0.0,0.0
75%,15.78,21.8075,105.25,784.15,0.1053,1.0,48.0,6.0,53012.0,1.0,1.0
max,28.11,39.28,188.5,2501.0,0.1634,1.0,58.0,8.0,53016.0,1.0,1.0


In [285]:
df_vazio = pd.DataFrame([])

In [286]:
df_vazio['age_0'] = df.query("diagnosis == 0").age.describe()
df_vazio['age_1'] = df.query("diagnosis == 1").age.describe()

In [287]:
df_vazio

Unnamed: 0,age_0,age_1
count,212.0,357.0
mean,28.051887,45.845938
std,5.631003,6.303764
min,15.0,28.0
25%,24.0,42.0
50%,28.0,46.0
75%,32.0,51.0
max,41.0,58.0


In [290]:
ls = [1,2,2,2,2,2,2,2,2,40]

In [291]:
np.mean(ls)

5.7

In [292]:
np.median(ls)

2.0

In [277]:
df_vazio['area'] = df['mean_area'].describe()

In [278]:
df_vazio

Unnamed: 0,area
count,544.0
mean,652.674449
std,349.628827
min,143.5
25%,419.525
50%,548.75
75%,784.15
max,2501.0


In [268]:
df.shape

(569, 12)