# Women Entrepreneurship EDA & Regression
Hello everyone! This notebook aims to do Exploratory Data Analysis (EDA) on the different features for the dataset and creating a regressor which will predict the 'Entrepreneurship Index'.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import geopandas as gpd
import plotly.express as px
import matplotlib.pyplot as plt
from scipy.stats import boxcox
from collections import Counter
from xgboost import XGBRegressor
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import SGDRegressor, Lasso, ElasticNet, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Preparing the dataset
The first step we will take is preparing the dataset so that it can be visualised.

In [None]:
df = pd.read_csv('../input/women-entrepreneurship-and-labor-force/Dataset3.csv')

In [None]:
df.head()

As we can see, our data is given to us in a bit of an awkward fashion; all of the columns are mushed into one. Therefore, we will seperate it into multiple features.

In [None]:
for i in np.array(df):
    values = pd.Series(str(i)[2:-2].split(';'))
    df = df.append(values, ignore_index=True)

In [None]:
df = df[51:].drop('No;Country;Level of development;European Union Membership;Currency;Women Entrepreneurship Index;Entrepreneurship Index;Inflation rate;Female Labor Force Participation Rate', axis=1)
df = df.reset_index(drop=True)

In [None]:
df.columns = ['No', 'Country', 'Level of development', 'European Union Membership', 'Currency', 
'Women Entrepreneurship Index', 'Entrepreneurship Index', 'Inflation rate', 
'Female Labor Force Participation Rate']

In [None]:
df['Entrepreneurship Index'] = df['Entrepreneurship Index'].astype(float)
df['Country'][list(df['Country']).index('Bosnia and Herzegovina')] = 'Bosnia and Herz.'
df = df.drop(list(df['Country']).index('Singapore')).reset_index(drop=True)

The numerical columns are converted from object to float so that our program can interpret them as numerical.

In [None]:
for col in ['Women Entrepreneurship Index', 'Entrepreneurship Index', 'Inflation rate', 'Female Labor Force Participation Rate']:
    df[col] = df[col].astype(float)

In [None]:
df

# Mapping the data

Our first visualisations will be to graph the different features on a map. This will show the value that each country represents in the columns.

The following 'mapped' subroutine uses Geopandas to create a map and show the value that each country has per feature.

In [None]:
def mapped(column):
    world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
    world.index = world['name']
    world = world.reindex(df['Country'])
    world.index = range(len(world))
    world[column] = df[column]
    world = world.fillna(0)

    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="5%", pad=0.1)
    ax.set_xlim(-130, 190)

    ax.set_title(column)
    world.plot(column=column, legend=True, cax=cax, ax=ax, cmap='OrRd', edgecolor='black')
    plt.show()

## Entrepreneurship Index
The subsequent map shows us that, out of our selected countries, **Australia** has the highest Entrepreneurship Index, along with **Iceland** and certain countries in Europe such as **Sweden, Finland and Ireland**. The countries with the least Entrepreneurship Index are **Russia, Thailand, India, Egypt, Algeria, Ghana, Brazil, Mexico and Bolivia**.

In [None]:
mapped('Entrepreneurship Index')

## Women Entrepreneurship Index
There is a lot of connection between the 'Entrepreneurship Index' feature and the 'Women Entrepreneurship Index' feature, as countries which do well in one column also do well in the other.

In [None]:
mapped('Women Entrepreneurship Index')

## Inflation Rate
The map shown below tells us that **Argentina** has the highest inflation rate, followed by **Ghana** and then **Russia**. The other countries all have a relatively low inflation rate.

In [None]:
mapped('Inflation rate')

## Female Labour Force Participation Rate
The Female Labour Force Participation Rate feature has a noticably higher average of values than the rest of the features. Countries in Asia, such as **Russia, China, India, Japan and Thailand** have the highest rate, along with **Australia**. A lot of countries in South America; **Mexico, Brazil, Peru, Argentina, Bolivia and Uruguay** have a moderate Female Labour Force Participation Rate.

In [None]:
mapped('Female Labor Force Participation Rate')

# Plotting binary data
The next step will be using bar charts to graph our binary 'Level of development', 'European Union Membership' and 'Currency' features.

In [None]:
def bar_charts(col, title, x, y, colour='blue'):
    count = Counter(df[col])
    bars = plt.bar(count.keys(), count.values(), color=colour)

    for bar in bars:
        score = list(count.values())[bars.index(bar)]
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height, score, ha='center', va='bottom')

    plt.title(title)
    plt.xlabel(x)
    plt.ylabel(y)
    plt.show()

## Level of development
The number of developed countries is roughly similar to the developing countries, with there being slightly more developed ones.

In [None]:
bar_charts('Level of development', 'Development of Countries', 'Development', 'Number of Countries')

## European Union Membership
There are a lot more countries that are not members of the EU than those that are. The amount that are members is two thirds than that of the amount that are not.

In [None]:
bar_charts('European Union Membership', 'European Member', 'Member or not', 'Number of Countries', 'red')

## Currency
The number of countries that use Euros are less than half of the amount that use their own national currency.

In [None]:
bar_charts('Currency', 'Currency for Countries', 'Type of Currency', 'Number of Countries', 'green')

# Correlation
Now we will check how much the features in our dataset correlate to each other.

The heatmap shown below displays the information that the most correlated features are the "Entrepreneurship Index" and the "Women Entrepreneurship Index" with 91%. Then the next set is "Female Labor Force Participation Rate" and "Women Entrepreneurship Index" which have 44%.

In [None]:
numerical = df[['Women Entrepreneurship Index', 'Entrepreneurship Index', 'Inflation rate', 'Female Labor Force Participation Rate']]
sns.heatmap(numerical.corr(), annot=True)
plt.show()

We will create a subroutine called 'scatter' which creates a seaborn regplot that displays the features scattered against each other and a line representing their connection.

In [None]:
def scatter(col1, col2, marker, colour):
    sns.regplot(data=df, x=col1, y=col2, marker=marker, color=colour)
    plt.title('Correlation')
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.show()

## Women Entrepreneurship Index & Entrepreneurship Index
These features have a very strong positive correlation which can be seen in the first scatter graph.

In [None]:
scatter('Women Entrepreneurship Index', 'Entrepreneurship Index', 's', 'blue')

## Women Entrepreneurship Index & Female Labor Force Participation Rate
These columns have a lower association, however there is still some of it, as seen in the following plot.

In [None]:
scatter('Women Entrepreneurship Index', 'Female Labor Force Participation Rate', 'x', 'green')

## Women Entrepreneurship Index & Inflation rate
The variables have a negative correlation to one another.

In [None]:
scatter('Women Entrepreneurship Index', 'Inflation rate', 'o', 'pink')

## Female Labor Force Participation Rate & Inflation rate
Most of the data points are grouped together in the lower right hand corner of the plot, therefore I wouldn't consider there to be a strong connection here.

In [None]:
scatter('Female Labor Force Participation Rate', 'Inflation rate', 'p', 'lightblue')

## Entrepreneurship Index & Inflation rate
The Entrepreneurship Index & Inflation rate do have somewhat of a negative association with each other.

In [None]:
scatter('Entrepreneurship Index', 'Inflation rate', 'h', 'orange')

## Entrepreneurship Index & Female Labor Force Participation Rate
These features do have a positive correlation with each other, however there are a lot of anomalies which are in the lower part of the plot.

In [None]:
scatter('Entrepreneurship Index', 'Female Labor Force Participation Rate', 'v', 'purple')

# Transforming the data
In this section, we will be taking a look at how the distributions of the different features are affected as they are transformed using the log, box cox, standard and min max scaler transformations.

In [None]:
for col in numerical:
    fig, axes = plt.subplots(1, 5, figsize=(15, 3))
    
    f1 = df[col]
    f2 = (df[col]+3).transform(np.log)
    f3 = pd.DataFrame(boxcox(df[col]+3)[0])
    f4 = pd.DataFrame(StandardScaler().fit_transform(np.array(df[col]).reshape(-1, 1)))
    f5 = pd.DataFrame(MinMaxScaler().fit_transform(np.array(df[col]).reshape(-1, 1)))
    
    for column in [[f1, axes[0], 'pink', 'Normal'], [f2, axes[1], 'lightblue', 'Log Transform'], 
                    [f3, axes[2], 'lightgreen', 'Box Cox'], [f4, axes[3], 'orange', 'Standard Scaler'], 
                    [f5, axes[4], 'skyblue', 'Min Max Scaler']]:
        feature = column[0]
        ax = column[1]
        colour = column[2]
        name = column[3]
        
        feature.hist(ax=ax, color=colour)
        ax.set_xlabel(name)
        
        deciles = feature.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
        for pos in np.array(deciles).reshape(1, -1)[0]:
            handle = ax.axvline(pos, color='darkblue', linewidth=1)
        ax.legend([handle], ['decile'])
        
    axes[2].set_title(col)
    axes[3].set_title('')
    axes[4].set_title('')
    
    plt.show()

# Predicting the data
The final part of this notebook will be to create regressors that can predict our 'Entrepreneurship Index' column.

## X and y
We firstly make X to be our 'Women Entrepreneurship Index' feature and y to be our 'Entrepreneurship Index'. We only need one column for X, as it has a very strong correlation to y.

Afterwards, we split the X and y into train and test sets.

In [None]:
X = np.array(df['Women Entrepreneurship Index']).reshape(-1, 1)
y = np.array(df['Entrepreneurship Index']).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Creating the predictors
Next, we will be trying out the XGBoost, Random Forest, Lasso, Elastic Net and Ridge regressors and use model score, r2 score, mean absolute error and mean squared error to see how the predictors perform with our X and y.

In [None]:
models = [['XGBoost', XGBRegressor()], ['Random Forest', RandomForestRegressor()], ['Lasso', Lasso()], 
          ['Elastic Net', ElasticNet()], ['Ridge', Ridge()]]

model_score = []
r2_scores = []
mae_list = []
mse_list = []

for classifier in models:
    
    name = classifier[0]
    model = classifier[1]
    
    model.fit(X_test, pd.DataFrame(y_test).values.ravel())
    y_pred = model.predict(X_test)
    
    score = model.score(X_test, y_test)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    model_score.append(score)
    r2_scores.append(r2)
    mae_list.append(mae)
    mse_list.append(mse)

    print(name)
    print('Model score:        ', score)
    print('R2 score:           ', r2)
    print('Mean absolute error:', mae)
    print('Mean squared error: ', mse)
    
    if model != models[-1][1]:
        print('')

## Visualising the performances
As seen in the final bar chart, the XGBoost has the best scores, followed by Random Forest, while the other algorithms (Lasso, Elastic Net and Ridge) have roughly similar scores.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

scores = [model_score, r2_scores, mae_list, mse_list]
names = ['XGBoost', 'Random Forest', 'Lasso', 'Elastic Net', 'Ridge']
score_names = ['Model Score', 'R2 Score', 'Mean Absolute Error', 'Mean Squared Error']
colours = ['red', 'blue', 'green', 'yellow', 'orange']
i = 0

for row in axes:
    for ax in row:
        bars = ax.bar(names, scores[i], color=colours[i])
        
        for bar in bars:
            score = str(scores[i][bars.index(bar)])[:4]
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height, score, ha='center', va='bottom')
        
        ax.set_title(score_names[i])
        i += 1
        
plt.show()

### Thank you for reading my notebook.

### If you enjoyed this notebook and found it helpful, please give it an upvote and provide feedback as it would help me make more of these.