# Import And Export Analysis Through Sea and Air

## Table of Contents

* [Introduction](#Introduction)
* [Data Preprocessing](#Data-Preprocessing)
* [Data Processing](#Data-Processing)
* [Model Selection](#Model-Selection)
* [Clustering](#Clustering)
* [Policies](#Policies)
* [Conclusion](#Conclusion)


## Introduction

This report was made to summarize the results of the project for Knowledge Engineering Lab (CS307).

The task was to download a novel Indian government dataset, work on it to get a tangible useful result by the methods of data mining and analysis

The dataset is taken from [data.gov.in](data.gov.in) on the **Import and Export of Commodities through Sea and Air in the State of Tamil Nadu**. That is, the four datasets are, Import Commodities through Sea, Import Commodities through Air, Export Commodities through Sea and Export Commodites through Air.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

## Data Preprocessing

The following steps were taken to process the data-

* The dataset had the @ character substituted for null values. This was replaced by zero as zero makes more intuitive sense.
* Any Row which had a Total entry was removed. This was done as it did not convey a meaning greater than the sum of its parts.
* Empty Row entries were Removed. (They were marked as a 'False' entry in the Commodity name)

## Data Processing

These were the steps that followed-

* The datasets were pivoted with the Name of the Country as the index, the Commodity group as the Columns and the Value of Trade (Rs. in crores) as the Value.
* Duplicate Entries due to variation in name of the Countries was fixed by summing the Duplicates. For example: UAE had a duplicate named United Arab Emirates. Their value was summed and UAE was chosen as the entry name.
* The Latitude and Longitude of the countries were appended to the datasets for easy plotting of the countries.

[Notebook1](Data_Cleaning.ipynb)


[Notebook2](Getting_countries_latitude_and_longitude.ipynb)

## Model Selection

* The dataset was trained with different clustering models with a common score (the calinski_harasbasz's score) to find the best training model.
* K Means performed best on all of the datasets except for Import commodities by Air which used AffinityPropagation.

[Notebook](Clustering_Model_Selection.ipynb)

## Clustering 

* The dataset was then clustered with their respective models.
* Their cluster number was then appended to a copy of the dataset and then saved.

[Notebook 1](Clustering.ipynb)


[Notebook 2](post_analysis/Clusters_data_sets.ipynb)

## Policies 

* An initial draft of the policies were created looking at the disparities in the volume of goods transfered.
* The policies were then filtered and the most important ones (according to volume of trade were selected).

[Notebook](post_analysis/Policies.ipynb)

## Conclusion

The dataset is taken from [data.gov.in](data.gov.in) on the Import and Export of Commodities through Sea and Air was sanity checked, filtered, transformed, clustered and again transformed into policies for use for policy makers

