# Use pandas to inspect tables on rossman dataset

We're starting from Sanyam Bhutani's Kaggle dataset made of fastai lesson 6v3 data (rossmann competition with all additional data). 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Read all tables
PATH=('../input/fastai-v3-lesson-6-rossmandataset/')
table_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']
tables = [pd.read_csv(PATH+f'{fname}.csv', low_memory=False) for fname in table_names]
train, store, store_states, state_names, googletrend, weather, test = tables

## Create a diagram showing possible tables relationships


In [None]:
# Grab the list of fields for each table
dicts = [{'src':n, 'field':df.columns} for n,df in zip(table_names,tables)]
pd.DataFrame(dicts)

**STEP1**: We've constructed a DataFrame (AKA table) using a list of dictionaries. This is useful because:
+ we can ensure that all the data are "aligned" (ie: same number of columns and data type)
+ each column is aligned using the dictionary key, so we can mix different type of values and deal with missing values
+ we have column names our final *table*

In [None]:
# Sample: mixing missing values
pd.DataFrame(
    [{'src':n, 'field':df.columns} for n,df in zip(table_names,tables)] 
  + [{'src':'What???','note':'this should be removed'}]
)

**STEP2**: flatten the `field` column, transforming a row containing a list of columns into an equivalent number of rows.

NOTE: again a great suggestion from Simon: the `explode` (aka `flatten`) operator is not even mentioned in [Wes McKinney](https://www.oreilly.com/library/view/python-for-data/9781449323592/) official pandas book ;-) 

In [None]:
df = pd.DataFrame(dicts).explode('field')
df

### **STEP3**: inline create dummy column. This *functional / immutable* approach is very compact and ensure that no changes will be applied to the original data. 

You can obtain the same result writing, but to do so you'll need to write this in multiple lines and assigning intermediate result to variable (AKA the *mutable way*).
```
df['isPresent']=True
df['myOtherField']=df.src + ":" + df.field
```

In [None]:
df.assign(isPresent=True, myOtherField=lambda r: r.src + ":" + r.field).head()

### **STEP4:** group-by and unstack. This operation is similar to Excel's "pivot table".

**IMPORTANT**: to prevent having hierarchical indices on the column we've selected the `['isPresent']`column. this means that we're acqually *unstacking* a Series, not a DataFrame.

In [None]:
df.assign(isPresent=True).groupby(['field','src'])['isPresent'].count().unstack().head(10)

### STEP5: filter and sort `fields` in order to have more frequent first.
We're going to skip the fields where count is less than two because we cannot make any "join" on them.
We obtain this in "two steps":
1. Compute the field order that we want: in this case we group by filed name and sort descending by count; moreover we'll filter out all the fields with count less than 2.
2. Reindex the resulting DataFrame with the index found in the previous step.

NOTE: Computing the field order, to be more clear, we've used the *immutable* rename operator.

In [None]:
# Compute field order
field_by_count = (df.groupby(['field']).count() # Group by field and take the count
                    .rename(columns={'src':'src_count'}) # Rename 'src' column 
                    ['src_count'] # Transform DataFrame into a Series
                    .loc[lambda x: x>=2] # Filter our all
                    .sort_values(ascending=False) # Sort values
                 )
field_by_count

**VERY IMPORTANT**: notice that `field_by_count` is **a Series, not a DataFrame**.
I've preferred to work with a Series because we're going to focus on a single field (src_count) to apply our business logic (keep if greater than 1 and sort descending).
This *transformation from DataFrame to Series* happens when we've selected the field `['src_count']`.

**FILTERING**: we've usewd the `iloc[function]` filter syntax in order to enforce an *immutable* approach. Another option to do the same could be: 
```
field_by_count = field_by_count[field_by_count>=2]
```

**NOTE:** On the previous cell I've shown a "general" approach that involves group-by and computations.
If we're interested in counting values only, pandas offer the Series method `.value_counts` that does the same thing in a more compact form.

In [None]:
# Compute field order ()
field_by_count = (df['field'].value_counts() # Short form to say: give me a Series with the count of field.
                    .loc[lambda x: x>=2] # Filter our all
                    .sort_values(ascending=False) # Sort values
                 )
field_by_count

In [None]:
# Reorder columns
df.assign(isPresent=True).groupby(['field','src'])['isPresent'].count().unstack().reindex(field_by_count.index).head(15)

### STEP6: transform into boolean
In order to be more clear we can convert to boolean the result of count (count is always 1 if we group by `src` and `field` because we don't have fields with duplicate name).

In [None]:
# transform into boolean
df.assign(isPresent=True).groupby(['field','src'])['isPresent'].count().astype(bool).unstack().reindex(field_by_count.index)

### STEP7: color in gray missing values in order to be more readable

NOTE: the `style` step should always be the last one, and it's returning value is no more a DataFrame.

In [None]:
# Final command with explantion
(
    df.assign(isPresent=True) # Add "isPresent" placeholder
      .groupby(['field','src']).count() # Groupby
      ['isPresent'] # Convert to series taking only this field
      .astype(bool) # convert type
      .unstack() # "Pivot" with respect to 'src'
      .reindex(field_by_count.index) # Reoder and filder
      .style.highlight_null(null_color='gray') # Change style of output
)

In [None]:
# astore the intermediate relationship table
rels = df.assign(isPresent=True).groupby(['field','src'])['isPresent'].count().astype(bool).unstack().reindex(field_by_count.index)
rels.style.highlight_null(null_color='gray')

In [None]:
# Restacking again shows us the non null relations :-)
rels.stack()

In [None]:
#Extract unique realtionships
rels_edges = rels.stack().reset_index().groupby('level_0')['src'].apply(tuple).unique()
rels_edges

In [None]:
# Remember the fields that made the join
tt = (rels.stack() # Make the stack
          .reset_index() # reset all comumn index
          .rename(columns={'level_0':'field'}) # rename due to reset index
          .groupby('field') # group by field and 
          ['src'] # transform into Series
          .apply(tuple) # transform the 'src' that is a list of rows into a tuple in order to be indexed
          .reset_index() # reset index in order to be able to proceeed
          .groupby('src')['field'] # Regroup result by src
          .apply(list) # Transform to list (this was a iterable)
          .apply(lambda x: x[0] + ('...' if len(x)>1 else '')) # transform the list into a string with the first field and ellipsis
     )
tt

In [None]:
rels_edges = list(tt.index)
rels_desc = tt.values
pd.DataFrame({'desc':rels_desc,'edges':rels_edges}) # Just to display!

**IMPORTANT**: we apply `tuple` and not `list` because tuple are immutable and hashable, so we can use `unique` operator to filter out duplicates.

In [None]:
import itertools
import numpy as np

rels_edge_pairs = [list(itertools.combinations(re, 2)) for re in rels_edges]
rels_edge_pairs

In [None]:
edges_df = pd.DataFrame({'desc':rels_desc,'pairs':rels_edge_pairs}).explode('pairs')
edges_df

In [None]:
import graphviz


dot = graphviz.Digraph(comment='Tables relationships')

for c in rels.columns: dot.node(c)
for i,r in edges_df.iterrows():
    if (r.desc=='file'): continue # Skip joining by file!
    if (r.pairs == ('train','test')) or (r.pairs == ('test','train')): continue # Skip train / test rels
    dot.edge(r.pairs[0],r.pairs[1],label=r.desc)
        
display(dot)

**NOTE**: this is a very nahive way of showing possible relationships, based only on the field name. This is the reason why I've filtered out the "file" field.

## External libraries: missingno to have an overview about missing values
Thnx to [Simon Grest](https://www.kaggle.com/simongrest) for this great suggestion: this library let's you quick figure out what are the columns with missing values.

In [None]:
import missingno as msno
for i,n in enumerate(table_names):
    df = tables[i]
    msno.matrix(df.sample(min(len(df),250)));
    plt.title(n);

In [None]:
googletrend

# Plot with pandas

In [None]:
# Plot by week
googletrend['date'] = googletrend['week'].apply(lambda x: x.split(' - ')[0])
googletrend['date'] = pd.to_datetime(googletrend['date'])
(googletrend.set_index('date')
           .groupby('file') # Group by "file/store"
           .trend #  Diveide into series
           .plot(figsize=(20,10), title='Trend by week') # Plot the group content
);

In [None]:
# Plot by month
(
    googletrend.set_index('date') # Set date time index: thi is needed to time-resample
               .groupby('file') # group values by file
               .resample('M').mean() # Resample and take the mean for each period  
               .swaplevel() # Swap index from (file,date) -> (date,file)
               ['trend'] # Transform into a serie to avoid multiple level indices
               .unstack() # Pivot and put "file" into columns
               .plot(figsize=(20,10)) # Plot each column into a separate line
)