![](https://www.vhv.rs/dpng/f/21-217176_street-png.png)

# Data Description

This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

In the training set, train.csv, you are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. These variables are not included in the test set. Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.

This is a code competition that relies on a time-series API to ensure models do not peek forward in time. To use the API, follow the instructions on the Evaluation page. When you submit your notebook, it will be rerun on an unseen test:

   - During the model training phase of the competition, this unseen test set is comprised of approximately 1 million rows of historical data.
   - During the live forecasting phase, the test set will use periodically updated live market data.

Note that during the second (forecasting) phase of the competition, the notebook time limits will scale with the number of trades presented in the test set. Refer to the Code Requirements for details.

# About me
   - i am Data Scientist in one of leading IT firms in Pakistan. I was Recently Enguaged with Radix Trading LLC which is a firm just like Jane Street which also work in High frequency algorithmic trading. They wanted to expand their Quantative team. I was part of their core expansion team in Pakistan. you can know more about me from my [linkedin profile](https://www.linkedin.com/in/hamxahbhatti/).
   - By looking at this dataset i can see a lot similarties with the work that i have done and  this dataset.
   - i have seen differnt EDAs on Kaggle and i can say that they have done pretty good job. Since not enough information is provided of this dataset i have tried to explain lot of concept that i think one should must know before going to data modeling phase.
    

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.gridspec as gridspec
from collections import defaultdict

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("../input/jane-street-market-prediction/train.csv")

* Dataset has 138 columns. by looking at eh first 5 rows of data it looks like that data has some missing values as well.

In [None]:
df.head()

In [None]:
meta_data = pd.read_csv('../input/jane-street-market-prediction/features.csv')

In [None]:
df.ts_id.unique()

 - Based on my past Experience as Quantative Researcher it is come to my knowledge that itraday trading patterns in High Frequency Algorithmic Trading are far more linked than iterday Patterns. So thats why i have decided to look at only one data of data for my Explaintory Data Analysis
 - Since not enough information is given about these features, i am assuming that ts_id here represent packet Number of any commodity from any stock exchange that was used while creating this data.
 - These packets can be any type usually the main type of packets that we receive of different commidities from stock exchange are about Trade, Delete, update, Reduce Messages/ Packets about changes in limit order book.
 

In [None]:
df.sort_values(by= ['date','ts_id'],inplace=True)

* After sorting data  by date i have decided to look only at the first day of data.

In [None]:
sample_df = df.query('date == 0')

* This is how the Statistics first day data looks like.

In [None]:
sample_df.describe()

By looking at the data for one day it seems like that they have only used specfic type of Messages/Packets from  Order Book thats why count of records for one day is very low compared to what i am used to deal with.

- Data has lot of missing  values. i am adding mean value  as missing value for these data points because of zero/mean reversion startegy that i think has been used here. 

In [None]:

sample_df = sample_df.apply(lambda x: x.fillna(x.mean()),axis=0)

* Here i am  plotting the histogram of all the features that are present in this dataset.

In [None]:

fig, axes = plt.subplots(nrows=44
                         , ncols=3,figsize=(25,250))
for i, column in enumerate(sample_df.iloc[:,7:].columns):
    sns.distplot(sample_df[column],ax=axes[i//3,i%3])


- By looking at all these charts in in one place it is tiresome to eyes i know :) thats why i will try to group these features in to categories that are provided by Compitition host.
- Looking at the Distribution of dataset it seems that most of the features are Normally distributed (Standardized) and they are mean/Zero reverted. Later i will try to analyze these features using different categories so it will be lot easier to understand them

- There are multiple approches that are being used by traders but i am mentioning two main approches that different firms use when trading which are.
     - [Momentum Based Startgies](https://www.investopedia.com/trading/introduction-to-momentum-trading/)
     - [Zero/Mean Reverted Startgies](https://decodingmarkets.com/mean-reversion-trading-strategy/)
- By looking at the Feature 0 it seems that it is a signal.(A signal is used to give indication about something for example in HFT it can be used to track [Volume imbalance](https://www.bluevillecapital.com/post/what-is-volume-imbalance-and-how-to-find-it) in Limit Order Book and can give us information about whether buy side more volume or Ask side has more volume).


- Looking at the Histogram of features from it seems that dataset has has been provided has features that are Zero reverted.

- if all of these concept seems alien to you guys :) .you guys can use these links above to get the idea.
  


# Weight
 - Here i will analyze the distibution of feature weight. as not enough information is given about this features i am assuming this weight can be based on some risk/volatility concept or it can be based on some volume parameter as volume plays very high role in trading. 

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})

sns.boxplot(sample_df["weight"], ax=ax_box)
sns.distplot(sample_df["weight"], ax=ax_hist)


 - From above chart we can see that lot of trades in day 0 has zero weight.thats why median is also close to zero. Another thing to note here is that it is given in data description is that competition host  has introduced zero in the dataset and i am quoting them here as well
     - "Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation"

# Resp 
resp is being used as predictive variable by most of kagglers. Here i will tried to analyze this group of variables to find out valueable insights.


In [None]:
resp_df = sample_df.iloc[:,2:7]
fig, axes = plt.subplots(nrows=2
                         , ncols=3,figsize=(20,10))
for i, column in enumerate(resp_df.columns):
    sns.distplot(sample_df[column],ax=axes[i//3,i%3],color='red')

We can see from the distribution is that all 5 of these resp features are perfectly zero reverted. Here i am quoting competition host from one of discussion regarding differnt resp values.
  - "The only response variable in evaluation is Resp (see the evaluation metric), Resp1 - Resp4 only exists in the train.csv, they are correlated to Resp but not exactly the same (see the data description). They are provided just in case some people want some alternative objective metrics to regularize their model training."
  
so from above quote we can get the idea that only resp feature is the main evaulation metric, Resp1 - Resp4 are the extension of Resp. as they have mentioned that they can be used to regularize model training.(i will be looking at these features when i will be in modeling phase)

In [None]:
corr = resp_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 10))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

*   From this correlation chart we can see that feature resp_4 has the highest positive correlation with feature resp and feature resp_1 has least positive correlation with feature resp.

In [None]:
g = sns.JointGrid(data=sample_df, x="weight", y="resp")
g.plot_joint(sns.scatterplot, s=100, alpha=.5)
g.plot_marginals(sns.distplot, kde=True)

   -  From chart of day one weights above we can see that some trades have weight close to 120 thats why i decided to divide these trades in to three categories that are less than 40 , Equal to 40 and less than 80 and trades having weight greater than 80. i will look into these in later sections.

In [None]:
conditions = [
    (sample_df['weight'] <= 40),
    (sample_df['weight'] > 40) & (sample_df['weight'] <= 80),
    (sample_df['weight'] > 80) 
    ]
values = ['tier_3', 'tier_2', 'tier_1']
sample_df['weight_tier'] = np.select(conditions, values)


# Meta Data
- it has been described by Competition host and it was my intial Hypothesis as well that tags in meta data represent some kind of concepts( for example as i mentioned above volume imbalance)  are used to create these features.


# Hypothesis
- This leds me to another hypothesis is that features made from same concepts will have same behaviour in accordance to resp values and features will have correlation between them as well.

- so in order to check validality of this hypothesis i will try to divide these features in to different categories. 

In [None]:
# meta_data_t.head()
categories =  defaultdict(list)

for columns  in meta_data.columns[1:]:
        categories[f'{columns}'].append(meta_data.query(f'{columns} == True')['feature'].to_list())


-  Now i will look at the distribution of columns by each tag to get better understanding.
- Another thing to note here is that some of features are made by the combination of different tags.so later i will try to separate those features from others that were made using only one tag.

# Group Tag 0
* This Group included Tag 0 features that are created with 1st  concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_0_df = sample_df[[*categories['tag_0'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=3,figsize=(25,50))
for i, column in enumerate(tag_0_df.columns):
    sns.distplot(tag_0_df[column],ax=axes[i//3,i%3],color='Green')

# Correlation Analysis of Group 0

In [None]:
corr = tag_0_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(45, 20))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

- As we Can se that among features of same tag there exist correlation apart from some of features like feature 56 has no correlation with any of member in this group.Some of the features are strongly correlated like feature 9,10 and 19,20. Some of the features are negative correlated for example feature 29 has negative correlation with feature 19 and feature 20.Same case for multiple features in this group/tag.

- Another thing to note here is that some features in this tag are part of multiple tags.i Will look in to these in detail later.

# Linear Regression Analysis of Group 0
   - Here i have tried to find out between different features of Tag 0 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 0 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 0 category and on y axis resp is used.
   

In [None]:
from scipy import stats
def r2(x, y):
    return stats.pearsonr(x, y)[0] ** 2

In [None]:
import seaborn as sns
class myjoint(sns.JointGrid):
    def __init__(self, x, y, data=None,height=7, ratio=5, space=.2,
                 dropna=True, xlim=None, ylim=None, size=None):
        super(myjoint, self).__init__(x, y, data,height, ratio, space,
                 dropna, xlim, ylim, size)
        plt.close(2)
        # Set up the subplot grid
        self.ax_joint = f.add_subplot(gs[1:, :-1])
        self.ax_marg_x = f.add_subplot(gs[0, :-1], sharex=self.ax_joint)
        self.ax_marg_y = f.add_subplot(gs[1:, -1], sharey=self.ax_joint)

        # Turn off tick visibility for the measure axis on the marginal plots
        plt.setp(self.ax_marg_x.get_xticklabels(), visible=False)
        plt.setp(self.ax_marg_y.get_yticklabels(), visible=False)

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 3, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_0_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
    r2_score = r2(x=sample_df[column].values,y=sample_df["resp"].values)
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* By Looking at the chart above we can get the idea that most of data is normaly distributed and features 10,19,20,29,30 has some what same distribution with respect to resp on y axis.Feature 56 has different distribution from any other feature that belong to this category. That explains why its correlation with other features is very low compared to correlation of other features.
* Positive relation between features 29,30,73,79 and resp(target variable) can be seen in **Linear Regression analysis** above.
* Negative relation between featuers 103,29,19 and resp(target variable) can be seen in above charts.
* Features 29 has highest **R2 score** in this tag of 0.0484 and feature 19 has second highest **R2 score** of 0.0368.
* Feature 115 has lowest **R2 score** of 0.0 in this tag/Group of features.

# Group  Tag 1
*    These features that are in tag 1 belong to same group due to same concepts that were used to create them. There are some features are using more than one concepts/tags.

In [None]:
tag_1_df = sample_df[[*categories['tag_1'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=3,figsize=(25,50))
for i, column in enumerate(tag_1_df.columns):
    sns.distplot(tag_1_df[column],ax=axes[i//3,i%3],color='k')

# Correlation Analysis of Group 1

In [None]:
corr = tag_1_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(45, 20))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

# Linear Regression Analysis of Group 1  
   - Here i have tried to find out between different features of Tag 1 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 1 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 1 category and on y axis resp is used.

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 3, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_1_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
#     plt.xlabel(f"{column}")
# f.tight_layout()


   - Features 15,16 has similar distribution. same goes for 25 ,26 and then (35,36) ,(88,94), (100,106) has same distribution. Feature 59 has different distribution than any other feature in this category.
    
   - ****Feature 59 and resp has some what similar data destribution as Weight and Resp distribution above. and that explains why correlation of feature 59 with other features from same category is very low compared to other features correlation****
   
* Positive relation between features 35,36,88,76,82,94 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between featuers 100,106,26,25 and resp(target variable) can be seen in above charts.
* Features 35 has highest **R2 score** in this tag of 0.0524 and feature 25 has second highest **R2 score** of 0.0483.
* Feature 118 has lowest **R2 score** of 0.0001 in this tag/Group of features.  
 
   

#  Group Tag 2
   - These features that are in tag 2 belong to same group due to same concepts that were used to create them. There are some features are using more than one concepts/tags.

In [None]:
tag_2_df = sample_df[[*categories['tag_2'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=3,figsize=(25,50))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_2_df.columns):
    sns.distplot(tag_2_df[column],ax=axes[i//3,i%3],color='m')

# Correlation Analysis of Group 2

In [None]:
corr = tag_2_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(45, 20))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   - in 6 features we have highest correlation with group of two for example features 13 ,14 have highest correlation with each other then same pattern goes on till feature 58( same pattern as feature 56,57,59 in above categories). then pattern start again. 
   - These correlation values again validate my hypothesis and Competition host quote that i mentioned above that features in this categories are made fraom same concept.

# Linear Regression Analysis of Group 2

   - Here i have tried to find out between different features of Tag 2 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 2 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 2 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 3, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_2_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 33,34,87,11 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between featuers 23,24 and resp(target variable) can be seen in above charts.
* Features 23 has highest **R2 score** in this tag of 0.0542 and feature 33 has second highest **R2 score** of 0.054.
* Feature 117 has lowest **R2 score** of 0.0001 in this tag/Group of features.  
 

#  Group Tag 3
* These features that are in tag 3 belong to same group due to same concepts that were used to create them. There are some features are using more than one concepts/tags.

In [None]:
tag_3_df = sample_df[[*categories['tag_3'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=3,figsize=(25,50))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_3_df.columns):
    sns.distplot(tag_3_df[column],ax=axes[i//3,i%3],color='m')

# Correlation Analysis of Group 3

In [None]:
corr = tag_3_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(45, 20))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   - Same correlation pattern in group of two features that we have seen in above Groups. Just like above odd  features before, 57 is has no or very little correlation with other features in this Group same behaviour that we have seen before and then after feature 57  that correlation pattern start again.

# # Linear Regression Analysis of Group 3   
   - Here i have tried to find out between different features of Tag 3 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 3 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 3 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 3, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_3_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


   - First 6 features has similar some what distribution then feature 57 has different distribution just like feature 56 , 59 in categories before and again same behaviour of features  as i menioned for category above goes on. 
* Positive relation between features 11,12,31,32 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between featuers 21,22 and resp(target variable) can be seen in above charts.
* Feature 31 has highest **R2 score** in this tag of 0.0599 and feature 21 has second highest **R2 score** of 0.0481.
* Features 92,98 has lowest **R2 score** of 0.0 in this tag/Group of features.  
 

# Group tag 4
* This Group included Tag 4 features that are created with 5th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_4_df = sample_df[[*categories['tag_4'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=3,figsize=(25,50))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_4_df.columns):
    sns.distplot(tag_4_df[column],ax=axes[i//3,i%3],color='m')

# Correlaion Analysis of Group 4

In [None]:
corr = tag_4_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(45, 20))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   - Same Correlation Pattern as i have described above. first 6 features have correlation with each other but not with other then we have feature 55 that have no correlation with any feature,and remaining 10 features have correlation with each other but not with any other feature.

#  Linear Regression Analysis of Group 4   
   - Here i have tried to find out between different features of Tag 4 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 4 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 4 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 3, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_4_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Feature 17,18,27,28 has different distribution in above charts than any  other features in this group. Most of the features in this Group are normally distributed. But shape of these charts is a lot different.These features have relatively high R2 Score than any other features in this group.
* Positive relation between features 27,28,7,8 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between featuers 17,18 and resp(target variable) can be seen in above charts.
* Features 27 has highest **R2 score** in this tag of 0.0638 and it is the highest i have seen so far.feature 17 has second highest **R2 score** of 0.0542.
* Feature 120 has lowest **R2 score** of 0.0001 in this tag/Group of features.  
 


# Group Tag 5
* This Group included Tag 5 features that are created with 6th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_5_df = sample_df[[*categories['tag_5'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=4
                         , ncols=2,figsize=(25,50))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_5_df.columns):
    sns.distplot(tag_5_df[column],ax=axes[i//2,i%2],color='y')

* From the chart above we can see that only 8 features belong to this category.
* Groups before this have more than 16 features but this one has only 8.
* Again all the features here are zero reverted.

# Correlation Analysis of Group 5

In [None]:
corr = tag_5_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 10))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* Same correation pattern can be seen here as well.
* first two features have strong correlation with each other and have no significant correlation with any other features.
* last 6 features has strong correlation relation with each other and does not have significant correlation first two features.

# Linear Regression Analysis of Group 5  
   - Here i have tried to find out between different features of Tag 5 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 5 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 5 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(4, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_5_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Despite having strong correlation with each other first two features's distribution in these charts above is quite different.
*  Feature 89,113 and feature 101 has somewhat similar distribution of data in above charts. As this behaviour is reflected in their correlation as well.
* Exactly same behaviour for features 95,119 and 107 in correlation charts and in these charts above.

* Positive relation between features 89,101,113 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* No Negative relation between features and resp(target variable) has seen in above charts.
* Features 101 has highest **R2 score** in this tag of 0.009 and feature 113 has second highest **R2 score** of 0.0074.
* Feature 83 has lowest **R2 score** of 0.0005 in this tag/Group of features.  
 


# Group Tag 6
* This Group included Tag 6 features that are created with 7th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_6_df = sample_df[[*categories['tag_6'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=10
                         , ncols=4,figsize=(25,50))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_6_df.columns):
    sns.distplot(tag_6_df[column],ax=axes[i//4,i%4],color='m')

* This Group/ tag has highest number of features so far i have seen.
* Previous categories had 17 features and one had 8 features.
By looking at the highest number of features in a category leds me to my another hypothesis is that



# Hypothesis
> Highest Numbers of features in a category can mean that concepts that are used here to create these features are lot more useful to explain this commidity's behaviour.


# Correlation Analysis of Group 6

In [None]:
corr = tag_6_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(55, 35))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* Same Positive correlation pattern in group of two features can be seen here as well.
* Then we can see Positive correlation  from feature 17 to feature 26 among features but these features are negatively correlated with features that are net to them.( From 27 to 36).
* same behaviour of positive correlation and negative correlation can be seen for last four features as well.

# Linear Regression Analysis of Group 6   
   - Here i have tried to find out between different features of Tag 6 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 6 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 6 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(10, 4, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_6_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* The pattern with strong correlation in group of two features can be seen here.
* First two features has strong correlation with each other. Then next 4 features have strong positive correlation with each other but some features have negative correlation and positive correlation with last 25 features.
* By looking at the features correlation it validates my Hypothesis about strong correlation among features that are created with similar concept.
* ****This Groups has one of the Highest  Number of features as well as Highest  R2 Score as well****.
* Positive relation between features 39,40,33,34,35,36,29,30,31,32,27,28 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between featuers 37,38,26,25,17,18,19,20,21,22,23,24,3,4,5,6,7,12,11 and resp(target variable) can be seen in above charts.
* Features 39 has highest **R2 score** in this tag of 0.1344 and feature 40 has second highest **R2 score** of 0.1099.
* Feature 13 has lowest **R2 score** of 0.0004 in this tag/Group of features.  


   > By Looking at the Highest Values of R2 score for this group this validates my **Hypothesis** that i have mentioed above.
 

# Group tag 7

* This Group included Tag 7 features that are created with 6th similar concept another thing these features that these features are present in tag 6 as well so these features are created from combination of similar concept.


In [None]:
tag_7_df = sample_df[[*categories['tag_7'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=1
                         , ncols=2,figsize=(10,8))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_7_df.columns):
    sns.distplot(tag_7_df[column],ax=axes[i%2],color='m')

# Correlation Analysis of Group 7

In [None]:
corr = tag_7_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(7, 7))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* From above chart we can see that there is strong positive correlation among these two features.

   # Linear Regression Analysis of Group 7
   - Here i have tried to find out between different features of Tag 7 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 7 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 7 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(20,15))
outer_grid = gridspec.GridSpec(1, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_7_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Strong Correlation can be seen in these tow features. 
* These two features are also present in Tag 6 as well.it means that two concept for example(as i have mentioned above) volume imbalace and price tick level can be used together to create combination of  features.
* Both features has some what Neutral Relationship with Resp  Values as seen in above charts.
* feature 1 has higest **R2 Score** in this Group of 0.0022.

#  Group Tag 8
* This Group included Tag 8 features that are created with 9th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_8_df = sample_df[[*categories['tag_8'][0]]]

In [None]:
fig, axes = plt.subplots(nrows =1
                         , ncols=2,figsize=(10,10))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_8_df.columns):
    sns.distplot(tag_6_df[column],ax=axes[i%2],color='m')

# Correlation Analysis of Group 8

In [None]:
corr = tag_8_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(7, 7))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* Strong positive correlation again in this group validates my Hypothesis as well.

 # Linear Regression Analysis of Group 8  
   - Here i have tried to find out between different features of Tag 8 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 8 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 8 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,15))
outer_grid = gridspec.GridSpec(1, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_8_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Strong positive correlation exist in both these features.
* There exist is a relationship in resp values and these two features. in above chart i have fitted a linear regression model to show it.
* Positive relation between features 3,4 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Features 3 has highest **R2 score** in this tag of 0.076.
* Feature 4 has lowest **R2 score** of 0.0681 in this tag/Group of features.  
 

# Group Tag 9
* This Group included Tag 9 features that are created with 10th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_9_df = sample_df[[*categories['tag_9'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=7
                         , ncols=3,figsize=(25,50))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_9_df.columns):
    sns.distplot(tag_9_df[column],ax=axes[i//3,i%3],color='m')

# Correlation Analysis of Group 9

In [None]:
corr = tag_9_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(45, 25))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

# Linear Regression Analysis of Group 9   
   - Here i have tried to find out between different features of Tag 9 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 9 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 9 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(7, 3, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_9_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Same relation that we have seen in previous tag between resp and features can be seen here in features 6 and 40.
* Feature 2 that is also present in Tag 7 has direct relation (positive correlation) with feature 71 that is present in this category.
* Positive relation between features 40,36,34,32,4,6,30,28 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between featuers 20,22,24,26,38 and resp(target variable) can be seen in above charts.
* Features 40 has highest **R2 score** in this tag of 0.1099 and feature 4 has second highest **R2 score** of 0.0681.
* Feature 14 has lowest **R2 score** of 0.0001 in this tag/Group of features.  
 

# Group Tag 10
* This Group included Tag 10 features that are created with 11th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_10_df = sample_df[[*categories['tag_10'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=1
                         , ncols=2,figsize=(17,7))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_10_df.columns):
    sns.distplot(tag_10_df[column],ax=axes[i%2],color='m')

* These features are present   in previous Group/tag 6 as well.
* By looking at the distribution of  data we can say that features are zero reverted.

# Correlation Analysis of Group 10 

In [None]:
corr = tag_10_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(8, 8))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* strong positive correlation  exist here in these features as well.

   # Linear Regression Analysis of Group 10
   - Here i have tried to find out between different features of Tag 10 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 10 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 10 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,20))
outer_grid = gridspec.GridSpec(1, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_10_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* By looking at the linear regression fitted above we can see that  linear regression can be used between resp(that is used as target variable by most notebook i have on kaggle.) and features in this group to explain their relationship.
* Positive relation between both features and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Features 5 has highest **R2 score** in this tag of 0.0759.
* Feature 6 has lowest **R2 score** of 0.0653 in this tag/Group of features.  
 

# Group Tag 11
* This Group included Tag 11 features that are created with 12th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_11_df = sample_df[[*categories['tag_11'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=5
                         , ncols=2,figsize=(30,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_11_df.columns):
    sns.distplot(tag_11_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 11

In [None]:
corr = tag_11_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 15))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* All the featuers that are present in this Group have Positive correlation with Each other.
* Strong correlation pattern in the group of two features can be seen here as well.
* No negative correlationamong features is present in this group.

   # Linear Regression Analysis of Group 11
   - Here i have tried to find out between different features of Tag 11 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 11 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 11 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,50))
outer_grid = gridspec.GridSpec(5, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_11_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 7,8,11,12 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* No Negative relation between features  and resp(target variable) in present in this Group.
* Features 7 has highest **R2 score** in this tag of 0.015 and feature 8 has second highest **R2 score** of 0.0112.
* Feature 14 has lowest **R2 score** of 0.0001 in this tag/Group of features.  
 

# Group Tag 12
* This Group included Tag 12 features that are created with 13th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_12_df = sample_df[[*categories['tag_12'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=8
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_12_df.columns):
    sns.distplot(tag_12_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 12

In [None]:
corr = tag_12_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(25, 20))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* First 12 features have Strong Positive Correlation with each other.
* Strong Postive correlation  in Groups of two features can be seen here as well.
* Last 4 features have negative correlation among them as well. 

   # Linear Regression Analysis of Group 12
   - Here i have tried to find out between different features of Tag 12 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 12 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 12 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(8, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_12_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Features 60 and feature 61 has different distribtion of data than any other features i have seen.
* No Positive relation between features and resp(target variable) is present in this Group.
* Negative relation between first 12 features  and resp(target variable) can be seen in above charts.
* Features 37 has highest **R2 score** in this tag of 0.0568 and feature 17,23 has second highest **R2 score** of 0.0542.
* Feature 66 has lowest **R2 score** of 0.0001 in this tag/Group of features.  
 

# Group Tag 13
* This Group included Tag 13 features that are created with 14th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_13_df = sample_df[[*categories['tag_13'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=8
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_13_df.columns):
    sns.distplot(tag_13_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 13

In [None]:
corr = tag_13_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(25, 20))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 13
   - Here i have tried to find out between different features of Tag 13 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 13 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 13 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(8, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_13_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between almost all features  and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* No Negative relation between featuers and resp(target variable) is found in this Group.
* Features 39 has highest **R2 score** in this tag of 0.1344 and feature 40 has second highest **R2 score** of 0.1099.
* Features 62.63 has lowest **R2 score** of 0.0 in this tag/Group of features.
* **This Tag looks like a relection of Previous Group**
 

# Group Tag 14
* This Group included Tag 14 features that are created with 15th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_14_df = sample_df[[*categories['tag_14'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=1
                         , ncols=3,figsize=(30,7))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_14_df.columns):
    sns.distplot(tag_14_df[column],ax=axes[i%3],color='m')

# Correlation analysis of Group 14

In [None]:
corr = tag_14_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(7, 7))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* feature 41 have Negative correlation with features 42 and 43.
* Positive Correlation between 42 and 43 is present in this group.

   # Linear Regression Analysis of Group 14
   - Here i have tried to find out between different features of Tag 14 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 14 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 14 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,30))
outer_grid = gridspec.GridSpec(3, 1, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_14_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Features 43 has highest **R2 score** in this tag of 0.0039 and feature 41 has second highest **R2 score** of 0.002.
* Features 42 has lowest **R2 score** of 0.00006 in this tag/Group of features.
* Distribution of these features is quite different than any other features that i have seen.
 

# Group Tag 15
* This Group included Tag 15 features that are created with 16th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_15_df = sample_df[[*categories['tag_15'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=13
                         , ncols=2,figsize=(25,55))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_15_df.columns):
    sns.distplot(tag_15_df[column],ax=axes[i//2,i%2],color='m')

* Skewness can be seen here in lot of feaures than any other groups before.
* Only features 72,73 and 75 have similar skewness as Normal Distribution.
* large Skewness value be due to smaller sample size that i using here.

# Correlation Analysis of Tag 15

In [None]:
corr = tag_15_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(25, 30))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 15
   - Here i have tried to find out between different features of Tag 15 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 15 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 15 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(13, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_15_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between 73,74,75,76,85,87,88 features  and resp(target variable) can be seen in ****Linear Regression analysis**** above.
*  Negative relation between features 77,97 and resp(target variable) is found in this Group.
* Features 85 has highest **R2 score** in this tag of 0.016 and feature 73 has second highest **R2 score** of 0.013.
* Features 98 has lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Group Tag 16
* This Group included Tag 16 features that are created with 17th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_16_df = sample_df[[*categories['tag_16'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=4
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_16_df.columns):
    sns.distplot(tag_16_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 16

In [None]:
corr = tag_16_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 10))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 16
   - Here i have tried to find out between different features of Tag 16 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 16 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 16 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,40))
outer_grid = gridspec.GridSpec(4, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_16_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* No positive or Negative Relation is seen  in This Group and resp(target variable)in ****Linear Regression analysis**** above.
* Features 70,121 has highest **R2 score** in this tag of 0.0008.
* Features 54 has lowest **R2 score** of 0.0 in this tag/Group of features.
* **This Tag looks like a relection of Previous Group**
 

# Group Tag 17
* This Group included Tag 17 features that are created with 18th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_17_df = sample_df[[*categories['tag_17'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=9
                         , ncols=3,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_17_df.columns):
    sns.distplot(tag_17_df[column],ax=axes[i//3,i%3],color='m')

# Correlation Analysis of Group 17

In [None]:
corr = tag_17_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(35, 25))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 17
   - Here i have tried to find out between different features of Tag 17 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 17 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 17 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(9, 3, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_17_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 79,81,82,91 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between features 46,50,103,106 and resp(target variable) is found in this Group.
* Features 79 has highest **R2 score** in this tag of 0.018 and feature 99 has second highest **R2 score** of 0.015.
* Features 115 has lowest **R2 score** of 0.0 in this tag/Group of features.
* **Features in this Tag have lower values of R2 Score than any other group that i have explored**
 

# Group Tag 18
* This Group included Tag 18 features that are created with 19th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_18_df = sample_df[[*categories['tag_18'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=1
                         , ncols=2,figsize=(10,10))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_18_df.columns):
    sns.distplot(tag_18_df[column],ax=axes[i%2],color='m')

# Correlation Analysis of Group 18

In [None]:
corr = tag_18_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* Both Features in this Tag have positive Correlation with other again validating my Hypothesis. 

   # Linear Regression Analysis of Group 18
   - Here i have tried to find out between different features of Tag 18 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 18 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 18 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(20,15))
outer_grid = gridspec.GridSpec(1, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_18_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation to some extent between features 44 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* No Negative relation between feat and resp(target variable) is found in this Group.
* Features 44 has highest **R2 score** in this tag of 0.018.
* Features 45 has lowest **R2 score** of 0.0001 in this tag/Group of features.
 

# Group Tag 19
* This Group included Tag 19 features that are created with 20th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_19_df = sample_df[[*categories['tag_19'][0]]]


In [None]:
fig, axes = plt.subplots(nrows=4
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_19_df.columns):
    sns.distplot(tag_19_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 19

In [None]:
corr = tag_19_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(25, 15))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* Strong  Positive Corrrelation is found in this group as well.

   # Linear Regression Analysis of Group 19
   - Here i have tried to find out between different features of Tag 19 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 19 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 19 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(4, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_19_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


*  No Positive relation between features  and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between features 46,50, and resp(target variable) is found in this Group.
* Features 51 has highest **R2 score** in this tag of 0.0011 and feature 47 has second highest **R2 score** of 0.0007.
* Features 49 has lowest **R2 score** of 0.0001 in this tag/Group of features.
* **Features in this Tag and Tag 17 have lower values of R2 Score than any other group that i have explored**
 

# Group Tag 20
* This Group included Tag 20 features that are created with 21th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_20_df = sample_df[[*categories['tag_20'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=2
                         , ncols=3,figsize=(15,15))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_20_df.columns):
    sns.distplot(tag_20_df[column],ax=axes[i//3,i%3],color='m')

# Correlation Analysis of Group 20

In [None]:
corr = tag_20_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 10))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 20
   - Here i have tried to find out between different features of Tag 20 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 20 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 20 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,40))
outer_grid = gridspec.GridSpec(3, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_20_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* No  Positive Or Negative relation between features  and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Features 69 has highest **R2 score** in this tag of 0.0033 and feature 71 has second highest **R2 score** of 0.0023.
* Features 54 has lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Group Tag 21
* This Group included Tag 21 features that are created with 22th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_21_df = sample_df[[*categories['tag_21'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=2
                         , ncols=3,figsize=(25,15))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_21_df.columns):
    sns.distplot(tag_21_df[column],ax=axes[i//3,i%3],color='m')

# Correlation Analysis of Group 21

In [None]:
corr = tag_21_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 15))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 21
   - Here i have tried to find out between different features of Tag 21 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 21 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 21 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(3, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_21_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Neutral relation between features  and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Features 59,57 have highest **R2 score** in this tag of 0.0007 and feature 55 ,58 has second highest **R2 score** of 0.015.
* Features 56 has lowest **R2 score** of 0.0 in this tag/Group of features.


# Group Tag 22
* This Group included Tag 22 features that are created with 23th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_22_df = sample_df[[*categories['tag_22'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=3
                         , ncols=3,figsize=(15,25))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_22_df.columns):
    sns.distplot(tag_22_df[column],ax=axes[i//3,i%3],color='m')

# Correlation Analysis of Group 22

In [None]:
corr = tag_22_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 15))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 22
   - Here i have tried to find out between different features of Tag 22 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 22 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 22 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(5, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_22_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Neutral relation between features  and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Feature 68 has highest **R2 score** in this tag of 0.009 and feature 67,64 has second highest **R2 score** of 0.0008.
* Features 62,63 have lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Group Tag 23
* This Group included Tag 23 features that are created with 24th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_23_df = sample_df[[*categories['tag_23'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=24
                         , ncols=2,figsize=(25,95))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_23_df.columns):
    sns.distplot(tag_23_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 23

In [None]:
corr = tag_23_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(55, 45))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 23
   - Here i have tried to find out between different features of Tag 23 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 23 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 23 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(10, 5, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_23_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 73,74,75,76,79,81,82,85,87,88,91,94,111,101 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between features 77,103,106 and resp(target variable) is found in this Group.
* Features 79 has highest **R2 score** in this tag of 0.018 and feature 73 has second highest **R2 score** of 0.016.
* Features 92 has lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Group Tag 24
* This Group included Tag 24 features that are created with 25th similar concept another thing to note here is that some of the features are created with multiple concept. 

In [None]:
tag_24_df = sample_df[[*categories['tag_24'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_24_df.columns):
    sns.distplot(tag_24_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 24

In [None]:
corr = tag_24_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(25, 35))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 24
   - Here i have tried to find out between different features of Tag 24 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 24 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 24 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_24_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 96,101 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between features 100,103,105,106 and resp(target variable) is found in this Group.
* Features 103 has highest **R2 score** in this tag of 0.0112 and feature 106 has second highest **R2 score** of 0.0111.
* Features 98 has lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Group Tag 25
* This Group included Tag 25 features that are created with 26th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_25_df = sample_df[[*categories['tag_25'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_25_df.columns):
    sns.distplot(tag_25_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 25

In [None]:
corr = tag_25_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 30))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 25
   - Here i have tried to find out between different features of Tag 25 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 25 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 25 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_25_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 85,86,87,88,89,91,94,111,101 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between feature 90 and resp(target variable) is found in this Group.
* Features 85 has highest **R2 score** in this tag of 0.016 and feature 91 has second highest **R2 score** of 0.015.
* Features 92 has lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Group Tag 26
* This Group included Tag 26 features that are created with 27th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_26_df = sample_df[[*categories['tag_26'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_26_df.columns):
    sns.distplot(tag_26_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 26

In [None]:
corr = tag_26_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(25, 35))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 26
   - Here i have tried to find out between different features of Tag 26 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 26 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 26 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_26_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 10,111,112,113 and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between feature 116 and resp(target variable) is found in this Group.
* Features 113 has highest **R2 score** in this tag of 0.0074 and feature 109 has second highest **R2 score** of 0.0015.
* Features 115 has lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Group Tag 27
* This Group included Tag 27 features that are created with 28th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_27_df = sample_df[[*categories['tag_27'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_27_df.columns):
    sns.distplot(tag_27_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 27

In [None]:
corr = tag_27_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(25, 35))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 27
   - Here i have tried to find out between different features of Tag 27 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 27 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 27 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_27_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


* Positive relation between features 73,74,75,76,79,81,80,82, and resp(target variable) can be seen in ****Linear Regression analysis**** above.
* Negative relation between feature 78 and resp(target variable) is found in this Group.
* Features 79 has highest **R2 score** in this tag of 0.018 and feature 73 has second highest **R2 score** of 0.016.
* Features 72 has lowest **R2 score** of 0.0001 in this tag/Group of features.
 

# Group Tag 28
* This Group included Tag 28 features that are created with 29th similar concept another thing to note here is that some of the features are created with multiple concept.

In [None]:
tag_28_df = sample_df[[*categories['tag_28'][0]]]

In [None]:
fig, axes = plt.subplots(nrows=6
                         , ncols=2,figsize=(25,45))
# fig = plt.figure(figsize = (25,20))
for i, column in enumerate(tag_28_df.columns):
    sns.distplot(tag_28_df[column],ax=axes[i//2,i%2],color='m')

# Correlation Analysis of Group 28

In [None]:
corr = tag_28_df.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 30))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='BrBG',  center=0,vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

   # Linear Regression Analysis of Group 28
   - Here i have tried to find out between different features of Tag 28 and resp.
   - for this i have plotted scatter plot  along with marginal distribution of each features in Tag 28 and fitted a linear model.
   - in these chart on x axis these are the different featues that belong to Tag 28 category and on y axis resp is used.
   

In [None]:

ratio=4
f = plt.figure(figsize=(25,60))
outer_grid = gridspec.GridSpec(6, 2, wspace=0.3, hspace=0.3)
for i, column in enumerate(tag_28_df.columns):
    gs = gridspec.GridSpecFromSubplotSpec(ratio+1, ratio+1,
            subplot_spec=outer_grid[i], wspace=0.3, hspace=0.3)
    g = myjoint(y="resp", x=column, data=sample_df, ratio=ratio)
    g = g.plot(sns.regplot, sns.distplot)
#     g = g.plot(sns.scatterplot,sns.distplot)
#     plt.xlabel(f"{column}")
    r2_score = r2(x=sample_df[column],y=sample_df["resp"])
    plt.xlabel(f"{column} R2 score:{round(r2_score,4)}")
# f.tight_layout()


*  NO Positive  or Negative relation between features and resp(target variable) is  Present   in ****Linear Regression analysis**** above.
* Features 121 has highest **R2 score** in this tag of 0.0121. 
* Features 120 has lowest **R2 score** of 0.0 in this tag/Group of features.
 

# Valueable Insights
These are some insights

* All the feature that are in same tag has some correlation with each other. 
* but there is one features that has no correlation with any other feature in that group and these features are present in first 5 groups that i have explored.
* This correlation pattern exist in group of first features it means that two features(for example 1,2) will have strong correlation then (3,4) will have strong correlation
* by Looking at the distribution of  features that have no correlation with any other member in group it seems that they have been created form same or combination of concepts that are being used in weight feature as they both have similar distribution of data.
* Apart from features that belong to tag 0, Other group's first 6 features have correlation in between them and have no strong correlation with any other member. Last 10 features have same strong correlation patterns as first 6 have in between them.

* Features with Postive Relation with Resp value tend to have Higer R2 Score than Features with Negative Relation for example features 30 and 40 have Highest **R2 Score** in The Groups that i have explored.
* Group Tag 6 have Highest Number of features as well Highest Number of R2 score i have seen in any other Group. **This Validates my Hypothsis about High Significance of the features that are present in this Group**.

* **Postive Relation and Higher R2 Score Between Resp and features indicates that these features can better explain the Resp Value Behaviour**
* **Features 41 ,42 and 43 have Different Distribution than any other feature.This means that they have been created from one similar idea/concept.**
* **Features in  Tag 18,19,20,21,22,28 and Tag 17 have lower values of R2 Score than any other group that i have explored. This can be due to high Skewness and kurtosis that is present in these features. Some of features have bimodel distribution instead of Normal distribution.These might be reasons of low  R2 score of these features.**

> Let me know your findings about this data in comment section below :) 

# References
[1] [Stackoverflow - How to plot multiple Seaborn Jointplot in Subplot ?](https://stackoverflow.com/questions/35042255/how-to-plot-multiple-seaborn-jointplot-in-subplot)