# Creating the Contabot Database

In order to create a integrated bot with a enterprise database, we will use some mock data available in [this repository](https://github.com/sinjoysaha/sales-analysis) (credits to [@sinjoysaha](https://github.com/sinjoysaha)) to simulate a history of sales.

## Steps
1. Clone the repository
2. Read the CSV files
3. Load all these files to a pandas dataframe
    a. Remove empty examples
4. Merge the dataframes
5. Save the examples to an unique dataframe
6. Save a new CSV file with all the examples

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import os

## Settings

In [2]:
CURRENT_PATH = os.getcwd()
REPOSITORY_DATASET_PATH = os.path.join(CURRENT_PATH, 'sales-analysis', 'dataset')
OUTPUT_PATH = os.path.join(CURRENT_PATH, 'output.csv')

## Importing the dataset

In [3]:
!git clone https://github.com/sinjoysaha/sales-analysis

Cloning into 'sales-analysis'...


In [4]:
csv_datasets = os.listdir(REPOSITORY_DATASET_PATH)
datasets = []
for dataset in csv_datasets:
    csv_path = os.path.join(REPOSITORY_DATASET_PATH, dataset)
    dataframe = pd.read_csv(csv_path, delimiter=',')
    datasets.append(dataframe)
print(f"{len(datasets)} datasets imported.")

12 datasets imported.


## Preprocessing

Merging the datasets

In [5]:
sales_dataset = pd.concat(datasets, axis='index')
sales_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 186850 entries, 0 to 11685
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Order ID          186305 non-null  object
 1   Product           186305 non-null  object
 2   Quantity Ordered  186305 non-null  object
 3   Price Each        186305 non-null  object
 4   Order Date        186305 non-null  object
 5   Purchase Address  186305 non-null  object
dtypes: object(6)
memory usage: 10.0+ MB


Removing empty rows

In [7]:
sales_dataset = sales_dataset.reindex()
sales_dataset = sales_dataset.dropna()
sales_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 186305 entries, 0 to 11685
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Order ID          186305 non-null  object
 1   Product           186305 non-null  object
 2   Quantity Ordered  186305 non-null  object
 3   Price Each        186305 non-null  object
 4   Order Date        186305 non-null  object
 5   Purchase Address  186305 non-null  object
dtypes: object(6)
memory usage: 9.9+ MB


## Saving the unified dataset

In [9]:
sales_dataset.to_csv(OUTPUT_PATH, sep=';', index=False)