# Análise descritiva dos dados (EDA)

1. Construa uma análise descritiva extraindo conhecimento das variáveis e apresentando quais insights podem ser obtidos a partir delas;

2. Mostre-nos um caminho para selecionar **graficamente** as variáveis mais ou menos importantes para cada problema, como elas se relacionam e porquê.

3. Em cada problema descreva quais outras técnicas poderiam ser aplicadas e porquê você não as escolheu.

4. Utilize os dados: eda_receitas_data.zip

### Sumário (ToC)



### Imports

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
import missingno as msno
from icecream import ic

In [2]:
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

In [3]:
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})

## Data Inspection
analisando a estrutura dos dados

In [4]:
df = pd.read_json("receitas.json")

In [5]:
init_samples = df.shape[0]

print(df.shape)
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

(20130, 11)
The dataset has 20130 rows and 11 columns.


In [6]:
df.head()

Unnamed: 0,directions,fat,date,categories,calories,desc,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,2006-09-01 04:00:00+00:00,"[Sandwich, Bean, Fruit, Tomato, turkey, Vegeta...",426.0,,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
1,[Combine first 9 ingredients in heavy medium s...,23.0,2004-08-20 04:00:00+00:00,"[Food Processor, Onion, Pork, Bake, Bastille D...",403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,2004-08-20 04:00:00+00:00,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
3,[Heat oil in heavy large skillet over medium-h...,,2009-03-27 04:00:00+00:00,"[Fish, Olive, Tomato, Sauté, Low Fat, Low Cal,...",,The Sicilian-style tomato sauce has tons of Me...,,5.0,Mahi-Mahi in Tomato Olive Sauce,"[2 tablespoons extra-virgin olive oil, 1 cup c...",
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,2004-08-20 04:00:00+00:00,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0


In [7]:
df.set_index

<bound method DataFrame.set_index of                                               directions   fat  \
0      [1. Place the stock, lentils, celery, carrot, ...   7.0   
1      [Combine first 9 ingredients in heavy medium s...  23.0   
2      [In a large heavy saucepan cook diced fennel a...   7.0   
3      [Heat oil in heavy large skillet over medium-h...   NaN   
4      [Preheat oven to 350°F. Lightly grease 8x8x2-i...  32.0   
...                                                  ...   ...   
20125  [Beat whites in a bowl with an electric mixer ...   2.0   
20126  [Bring broth to simmer in saucepan.Remove from...  28.0   
20127  [Using a sharp knife, cut a shallow X in botto...  38.0   
20128  [Heat 2 tablespoons oil in heavy medium skille...  24.0   
20129  [Position rack in bottom third of oven and pre...  10.0   

                           date  \
0     2006-09-01 04:00:00+00:00   
1     2004-08-20 04:00:00+00:00   
2     2004-08-20 04:00:00+00:00   
3     2009-03-27 04:00:00+00:0

In [8]:
df.columns

Index(['directions', 'fat', 'date', 'categories', 'calories', 'desc',
       'protein', 'rating', 'title', 'ingredients', 'sodium'],
      dtype='object')

In [9]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20130 entries, 0 to 20129
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype              
---  ------       --------------  -----              
 0   directions   20111 non-null  object             
 1   fat          15908 non-null  float64            
 2   date         20111 non-null  datetime64[ns, UTC]
 3   categories   20111 non-null  object             
 4   calories     15976 non-null  float64            
 5   desc         13495 non-null  object             
 6   protein      15929 non-null  float64            
 7   rating       20100 non-null  float64            
 8   title        20111 non-null  object             
 9   ingredients  20111 non-null  object             
 10  sodium       15974 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(5), object(5)
memory usage: 1.7+ MB


#### Insights 1:
- $1$ field tipo _data_ | $5$ fields tipo _float_ | $5$ fields tipo _object(string)_

- várias observações nulas a tratar

In [10]:
# evitando erro de tipagem
for column in df:
    df[column] = df[column].astype(df[column].dtype)

In [11]:
numeric_cols = df.select_dtypes(include=["number"]).columns
print(numeric_cols)

non_numeric_cols = df.select_dtypes(exclude=["number"]).columns
print(non_numeric_cols)

Index(['fat', 'calories', 'protein', 'rating', 'sodium'], dtype='object')
Index(['directions', 'date', 'categories', 'desc', 'title', 'ingredients'], dtype='object')


##### valores numéricos

In [12]:
df[numeric_cols].describe()

Unnamed: 0,fat,calories,protein,rating,sodium
count,15908.0,15976.0,15929.0,20100.0,15974.0
mean,346.0975,6307.857,99.946199,3.71306,6211.474
std,20431.02,358585.1,3835.616663,1.343144,332890.3
min,0.0,0.0,0.0,0.0,0.0
25%,7.0,198.0,3.0,3.75,80.0
50%,17.0,331.0,8.0,4.375,294.0
75%,33.0,586.0,27.0,4.375,711.0
max,1722763.0,30111220.0,236489.0,5.0,27675110.0


##### valores literais

In [None]:
ic(df[non_numeric_cols].sample())

ic| df[non_numeric_cols].sample():                                              directions  \
                                   9567  [Steam parsnips over medium heat until tender,...   
                                   
                                                             date  \
                                   9567 2004-08-20 04:00:00+00:00   
                                   
                                                                                categories  \
                                   9567  [Mustard, Side, Sauté, Steam, Christmas, Thank...   
                                   
                                                                           desc  \
                                   9567  An excellent complement to roast pork.   
                                   
                                                                                    title  \
                                   9567  Maple-Glazed Parsnips with Popped Mustard 

Unnamed: 0,directions,date,categories,desc,title,ingredients
9567,"[Steam parsnips over medium heat until tender,...",2004-08-20 04:00:00+00:00,"[Mustard, Side, Sauté, Steam, Christmas, Thank...",An excellent complement to roast pork.,Maple-Glazed Parsnips with Popped Mustard Seeds,"[2 pounds parsnips, peeled, quartered lengthwi..."


In [None]:
print('de ' + str(df.date.min()) + ' até ' + str(df.date.max()))

de 1996-09-01 20:47:00+00:00 até 2016-12-13 13:00:00+00:00


In [20]:
# estimativa de tags por receita
cats = df.categories.values.tolist()
y = []

In [19]:
for x in cats[50:60]:
    y.append(len(x))
    print(len(x))
np.mean(y)

17
9
16
13
9
15
12
17
9
11


12.298357664233576

_List comprehension_ poderia ter sido usada, mas descartada devido a erros de tipagem ao calcular `np.array`'s

### Dicionário de Features

| feature | descrição |
| --- | :------ |
| directions | passo a passo com as intruções de preparo |
| fat | quantidade de gordura na receita |
| date | data em que a receita foi adicionada |
| __categories__ | _tags_ que categorizam o tipo de receita |
| calories | quantidade de calorias na receita |
| desc | detalhes e feedback sobre a receita |
| protein | quantidade de proteína na receita |
| __rating__ | nota entre 0 a 5 da receita |
| title | Nome dado à receita |
| ingredients | ingredientes usados na receita |

#### Insights 2:
- **categories** e **rating** são possíveis valores de *target* ou *rotulação*

- categories precisa de mais granularidade para análise

- desagregar `date` e `time` possibilita mais análises de insight

- A quantidade de categorias na receita e as tags em si tornam-se features interessantes

### Data quality

##### valores ausentes

In [None]:
# quantidade de valores ausentes por linha
df.isnull().sum().sort_values(ascending=False)

é plausível que a repetição do $nº 19$ não seja coincidência

In [None]:
msno.matrix(df, labels=True)

há uma linha transversal na matrix, indicando linhas completamente vazias

In [None]:
ic(df[df.title.isnull()].shape[0])
df[df.title.isnull()]

de fato, há $19$ colunas sem dados

In [None]:
df.isnull().mean().sort_values(ascending=False)

##### valores duplicados

In [None]:
ic(df.duplicated(['title','date','desc','rating']).value_counts())

In [None]:
df[df.duplicated(['title','date','desc','rating'], keep='first')]

> Considerando que esta base tenha origem em um site de _reviews_, a repetição de alguns valores são aceitáveis ao modelo de negócio, mas nem todos. Por isso, _reviews_ muito similares serão removidas em sequência

##### valores únicos

In [None]:
unique_values = df[numeric_cols].nunique().sort_values(ascending=False)

In [None]:
ic(unique_values)
unique_values.plot.bar(
    logy=True, figsize=(12, 6), title="Valores únicos por feature numérica"
);

In [None]:
# cats = df['categories'].tolist()
# cats
# cats_flat = [i for cat in cats for i in cat]

## Data Wrangling
Cleaning, Feature Engineering and Preprocessing

##### To-Do:

1. Remover valores duplicados

2. Remover as linhas mais incompletas
    1. sem valores
    2. muitas features faltantes    


3. Completar valores faltantes com _Imputer_

4. Gerar novas features
    1. tags (one-hot)
    2. quantidade de tags


5. Remover outliers

In [None]:
for x in cats:
    cats_count.append(len(x))
np.mean(cats_count)

TypeError: object of type 'float' has no len()