# Report of Exploratory Analysis for the Customer Service team at Profeco

In [1]:
# Import libraries
import pandas

In [2]:
# Import datafile
df = pandas.read_csv(r'data\all_data.csv', dtype='unicode')

## How many commercial chains are monitored, and therefore, included in this database?

In [3]:
print('There are {} commercial chains monitored.'.format(df['cadenaComercial'].nunique()) , end= "\n\n")

There are 705 commercial chains monitored.



## What are the top 10 monitored products by State?

In [4]:
for s in df['estado'].unique():
    print('The top 10 products for {} are:'.format(s))
    print(df.loc[df['estado'] == s].groupby(['producto'])['producto'].count().sort_values(ascending=False).head(10).to_string(header=False), end= "\n\n")

The top 10 products for DISTRITO FEDERAL are:
REFRESCO                   287463
FUD                        207569
LECHE ULTRAPASTEURIZADA    175640
DETERGENTE P/ROPA          173452
YOGHURT                    136720
CERVEZA                    136686
MAYONESA                   131103
CHILES EN LATA             130598
JABON DE TOCADOR           129889
SHAMPOO                    125603

The top 10 products for MÉXICO are:
REFRESCO                   194939
FUD                        149141
DETERGENTE P/ROPA          132862
LECHE ULTRAPASTEURIZADA    116522
JABON DE TOCADOR            97330
YOGHURT                     94852
MAYONESA                    94286
CHILES EN LATA              92539
SHAMPOO                     92307
CERVEZA                     91747

The top 10 products for COAHUILA DE ZARAGOZA are:
FUD                        28876
REFRESCO                   27164
DETERGENTE P/ROPA          26706
SHAMPOO                    21401
LECHE ULTRAPASTEURIZADA    21047
JABON DE TOCADOR     

REFRESCO                27770
FUD                     17776
PANTALLAS               17383
DETERGENTE P/ROPA       15510
LAVADORAS               13242
SHAMPOO                 12844
JABON DE TOCADOR        12799
CHILES EN LATA          11537
PLANCHAS                11301
COMPONENTES DE AUDIO    11036

The top 10 products for CAMPECHE are:
FUD                        12960
REFRESCO                   11333
PANTALLAS                  10449
LAVADORAS                   9729
DETERGENTE P/ROPA           8307
LECHE ULTRAPASTEURIZADA     8027
REFRIGERADORES              7572
SHAMPOO                     7560
JABON DE TOCADOR            7543
PLANCHAS                    7139

The top 10 products for CHIAPAS are:
REFRESCO                   14452
FUD                        10417
DETERGENTE P/ROPA           8831
PANTALLAS                   8395
JABON DE TOCADOR            7841
CHILES EN LATA              7707
SHAMPOO                     7601
LECHE ULTRAPASTEURIZADA     7278
DESODORANTE                 7

## Which is the commercial chain with the highest number of monitored products?

In [5]:
cch = df.groupby(['cadenaComercial'])['producto'].nunique().sort_values(ascending=False).head(1)
print('The commercial chain with the highest number of monitored products is {} with {} products'.format(cch.index.values[0], cch.values[0]))

The commercial chain with the highest number of monitored products is SORIANA with 1059 products


## Use the data to find an interesting fact.

In [6]:
ccmb = df.groupby(['cadenaComercial'])['marca'].nunique().sort_values(ascending=False).head(1)
print('The commercial chain offering the most brands is {} with {} brands.'.format(ccmb.index.values[0], ccmb.values[0]))

bmp = df.groupby(['marca'])['producto'].nunique().sort_values(ascending=False).head(1)
print('The brand with most products is {} with {} products.'.format(bmp.index.values[0], bmp.values[0]))

print('We are monitoring {} brands.'.format(df['marca'].nunique()))

The commercial chain offering the most brands is WAL-MART with 1338 brands.
The brand with most products is S/M with 565 products.
We are monitoring 2079 brands.


## What are the lessons learned from this exercise?
The main lesson of this project was to use the Pandas library, I had not worked with Pandas before, at first it was a little complex to understand the functions of this library, but once understood, it became easier and easier to use this library.

So I can say that I was pleasantly surprised by the knowledge of Pandas, and surely it would have made my life easier in the past processing large volumes of data.

## Can you identify other ways to approach this problem? Explain.
In a less efficient way, I would have used functions, pure Python properties and some regular expressions. Specifically, commands such as "open" and "read" would be used.

Another possible approach would have been to insert all the data into a MySQL or MSSQL type database and from there perform different database queries until the expected results were obtained. 

Needless to say, using Pandas is the best option by far. 