# Business Case -- Cloudwalk 

## Data Analyst

### Candidate: Rafael Feltrin

Let's start by importing functions from the modules I wrote to perform the challenge's Parts I and II, which are respectively *etl_script.py* and *eda_script.py*.

In [None]:
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

from scripts.etl_script import (read_csv, 
                                write_database_dates,
                                write_associates_and_branches_cols,
                                create_geodf)

# Part I : ETL

We must start by opening the .csv file, creating some new columns as we see fit -- they might be used here or could be nice to have in future analyses.

So we begin we the *read_csv* method, which takes the file path and already does a simple change, which is filling the root CNPJ and the full CNPJ with leading zeroes in case they are missing.

I always do that because this kind of data tends to end up in the hands of less tech-savvy areas such as sales ops and marketing and it is very common for them to match these CNPJ columns with manually collected spreadsheets or CRM outputs which commonly has the leading zeroes.

In [None]:
df = read_csv(path=r'C:\Users\rafaf\PycharmProjects\data_analyst_case_cloudwalk\data\data_case_2024_03.csv',
              sep=',',
              zfill_cols=['document_number', 'cnpj_basico'])

display(df.head(10))

We can see it works.

Next, we will convert the string-formatted to a database-friendly date format just in case we need it later. The *write_database_dates* function does that in a very modular fashion -- if we went back to the Roman calendar or added a new month we could only update the inputs!

Back to the real world, we input the 12 months in Portuguese and run it.

In [None]:
portuguese_months = ['janeiro', 'fevereiro', 'março', 'abril',
                     'maio', 'junho', 'julho', 'agosto', 
                     'setembro', 'outubro', 'novembro', 'dezembro']

month_numbers = [f"{i:02d}" for i in range(1, len(portuguese_months) + 1)]

months = {name: num for name, num in zip(portuguese_months, month_numbers)}

df = write_database_dates(df=df, 
                          date_col='opening_date', 
                          months=months)

display(df['formatted_opening_date'].head(10))

Another interesting column is the one with the amount of branches and associated. It is a JSON-like structure, and we parse it with the *write_associates_and_branches_cols* to add two new integer columns with the number of associates and branches.  

In [None]:
df = write_associates_and_branches_cols(df=df,
                                        json_col='total_branches_and_associates')

display(df[['total_associates', 'total_branches']].head(10))

Finally, the cherry on top. Using the *geobr* package we can get official geospatial data from IPEA, which we will need when plotting map charts in Part II.

The *create_geodf* function reads all IBGE city codes in the dataset and gets their geometry in a GeoDataFrame format, which we will later join with a pivot table of company data to plot the maps and show business insights.   

In [None]:
gdf = create_geodf(df=df,
                   city_code_col='city_code',
                   year=2022)

display(gdf.head(10))

# Part II: EDA