In [9]:
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('postgresql://postgres:postgres@localhost/postgres')

def load_data(tablename) :
    return pd.read_sql_query('select * from '+ tablename,con=engine)

### Step 1: Load the Dataset
First, we need to load the dataset. We’ll use the load_data method that we've implemented before. 

In [10]:
mktDF = load_data('dim_mkt')
mktDF.head()

Unnamed: 0,tag,short,long,display_order,parent_tag,hier_num,hier_name,hier_level_num,hier_level_name
0,M000000000000100249400000000000001005858,Total Coverage,GB Total Coverage,1,,1,1002494GB AB TOTAL COVERAGE BY CONSUMER VIEW,1,GB_M_1_AB_TOTALCOVERAGE
1,M000000000000100249400000000000001006867,Convenience,GB Total Coverage Convenience,2,M000000000000100249400000000000001005858,1,1002494GB AB TOTAL COVERAGE BY CONSUMER VIEW,2,GB_M_3_CONSUMERGROUP
2,M000000000000100249400000000000001006868,Convenience Stores,GB Total Coverage Convenience Convenience Stores,3,M000000000000100249400000000000001006867,1,1002494GB AB TOTAL COVERAGE BY CONSUMER VIEW,3,Consumer View
3,M000000000000100249400000000000001006652,High Street,GB Total Coverage High Street,4,M000000000000100249400000000000001005858,1,1002494GB AB TOTAL COVERAGE BY CONSUMER VIEW,2,GB_M_3_CONSUMERGROUP
4,M000000000000100249400000000000001006653,Large Stores,GB Total Coverage High Street Large Stores,5,M000000000000100249400000000000001006652,1,1002494GB AB TOTAL COVERAGE BY CONSUMER VIEW,3,Consumer View


### Step 2: Drop Duplicates

Duplicate rows can skew your analysis and lead to incorrect results. In our case the data come from a table in the postgres with a primary key, so we havent any duplication here. But if the data was token from other platform this step could be so important.


In [11]:
mktDF = mktDF.drop_duplicates()

### Step 3: Drop Unwanted Columns

Droping unwanted columns from the dataframe could lead to higher performance. we use inplace option to avoid reassigning dataframe.

- its highly recommend to do this on your select from the table in the database . 

In [12]:
mktDF.isnull().sum(axis = 0)

tag                0
short              0
long               0
display_order      0
parent_tag         8
hier_num           0
hier_name          0
hier_level_num     0
hier_level_name    0
dtype: int64

In [13]:
unwanted_column_headers = ['display_order']

mktDF.drop(columns=unwanted_column_headers, axis=1, inplace=True)

### let's get some insight from the columns

In [14]:
mktDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   tag              38 non-null     object
 1   short            38 non-null     object
 2   long             38 non-null     object
 3   parent_tag       30 non-null     object
 4   hier_num         38 non-null     int64 
 5   hier_name        38 non-null     object
 6   hier_level_num   38 non-null     int64 
 7   hier_level_name  38 non-null     object
dtypes: int64(2), object(6)
memory usage: 2.5+ KB


In [15]:
mktDF.describe()

Unnamed: 0,hier_num,hier_level_num
count,38.0,38.0
mean,3.921053,2.289474
std,2.045187,0.80229
min,1.0,1.0
25%,2.25,2.0
50%,4.0,2.5
75%,5.0,3.0
max,8.0,3.0


# Hierarchy Markets

We have some chain store here with hierarchical relation . For better understanding of their relation lets trace them in a tree view.

In [16]:
def trace(parenttag = 0):
    if parenttag == 0 :
        for index, row in mktDF[mktDF['hier_level_num'] == 1].iterrows():
            print (row['long'])
            trace(parenttag = row['tag'])
    else:
        for index, row in mktDF[mktDF['parent_tag'] == parenttag].iterrows():
            print( row['hier_level_num'] * '\t', row['long'])
            trace(parenttag = row['tag'])
        
trace()

GB Total Coverage
		 GB Total Coverage Convenience
			 GB Total Coverage Convenience Convenience Stores
		 GB Total Coverage High Street
			 GB Total Coverage High Street Large Stores
			 GB Total Coverage High Street Small Stores
		 GB Total Coverage Out of Town
			 GB Total Coverage Out of Town Megastores
			 GB Total Coverage Out of Town Superstores
GB Symbols
GB Wilkinson GSD
GB Grocery Multiples
		 GB Grocery Multiples England & Wales
			 GB Grocery Multiples England & Wales Central
			 GB Grocery Multiples England & Wales East of England
			 GB Grocery Multiples England & Wales Lancs and English Border
			 GB Grocery Multiples England & Wales London
			 GB Grocery Multiples England & Wales North East
			 GB Grocery Multiples England & Wales South & South East
			 GB Grocery Multiples England & Wales South West
			 GB Grocery Multiples England & Wales Wales & West
			 GB Grocery Multiples England & Wales Yorkshire
		 GB Grocery Multiples Scotland
			 GB Grocery Multiples Scotland 

# outliers and missing value

we hadn't any outlier and missing value here.


In [17]:
mktDF.isnull().sum(axis = 0)

tag                0
short              0
long               0
parent_tag         8
hier_num           0
hier_name          0
hier_level_num     0
hier_level_name    0
dtype: int64

 
 As we see we have just 8 null value in this dataframe and all of them are in the parent_tag column.
 
We are looking into a hierarchy relation of some chain stores .We could image it as some trees. we know the root of a tree have no father. So null value in this column has a meaning .

# Based on the concept,some Steps are not need here

### Bivariate Analysis
### Feature Engineering
### Feature Scaling
### Feature Normalization
### Dimensionality reduction