In [7]:
%run  Data_processing.ipynb
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from scipy import stats

### Machine learning models used
> * Linear Regression
* Logistic Regression
* Naive Bayes
* Random Forest 
* Support Vector Machines

In [87]:
# Show final data
final.head()

Unnamed: 0,Country,Code,ATMs per 1000 Adults,iPhone Users,Starbucks Locations,GEI Score
0,Switzerland,CH,97.611221,11700000,61,1
1,Canada,CA,221.126457,29390000,1468,1
2,Sweden,SE,40.178787,12639000,18,1
3,Denmark,DK,54.014077,7266000,21,1
4,Australia,AU,160.13778,31770000,22,1


### 1- Linear Regression

In [45]:
# Subset data into features and labels 
x = final[["ATMs per 1000 Adults","iPhone Users","Starbucks Locations"]]
y = final[["GEI Score"]]

# Standardizing the variables
stdsc = StandardScaler()
x_std = stdsc.fit_transform(x)
y_std = stdsc.fit_transform(y)

model = linear_model.LinearRegression(fit_intercept=True)
model.fit(x_std, y_std)

print "R^2 is", round(model.score(x_std, y_std)*100), "%"

R^2 is 25.0 %


### What does that mean?

R2 is a percentage of how close the data are to be fitted on a regression line so 24% is clearly a low score. However, this is can tell use that the data is non-linear and so the variables need to be adjusted or a new nonlinear algorithm needs to be used.

GEI scores are calculated by combining a chunk of elements. One of the elements is people's attitudes and perception towards entrepreneurship. In fact,  human psychology is always very hard to predict and usually always has a low R2. However, this can be fixed in the other models

In [6]:
# Renaming coloumns
x_std = pd.DataFrame(x_std)
x_std.columns = ["ATMs per 1000 Adults","iPhone Users","Starbucks Locations"]

# OLS Regression Results using Scipy
res = sm.OLS(y_std, x_std).fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.201
Method:                 Least Squares   F-statistic:                     5.371
Date:                Sat, 18 Mar 2017   Prob (F-statistic):            0.00281
Time:                        21:58:31   Log-Likelihood:                -66.392
No. Observations:                  52   AIC:                             138.8
Df Residuals:                      49   BIC:                             144.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
ATMs per 1000 Adults     0.4573 

### Intresting Results:
 
 The OLS regression results conclude that the ATM variable is not statistically significant. Although it has the strongest coefficient and correlation with the label, some outliers need to be removed to get better results.
 
It's also very interesting to know that as iPhone Users increase per country the entrepreneurship index decreases. It was always thought that as a country develops people adapt to new technologies and innovate. This project's outcome tells us the opposite.

### Classification Model's outline
> * Transform target variable into a binary outcome
* Split data into a training and a testing set
* Train model and evaluate score

### Transform data for binary classification

In [47]:
# Convert the GEI Score into a binary output
final = final.sort_values(["GEI Score"], ascending=False)
final = final.reset_index(drop=True)

# Any country with GEI score above 42.2 is considered a potential country
final['GEI Score'] = final['GEI Score'].apply(lambda x: 1 if x >= 42.2 else 0)
final['GEI Score'] = final["GEI Score"].astype(bool)
final['GEI Score'] = final["GEI Score"].astype(int)

<img src="pics/GEI.png" width="850">

This map was exported from the GEDI website, check it out [here](http://thegedi.org/countries), and it shows that countries with an index lower than 42.2, colored in orange, were not considered top ranked countries and that is why countries above that threshold were considered awesome for expansion and were labeled 1 and all the other countries were labeled 0. 

In [50]:
final.head()

Unnamed: 0,Country,Code,ATMs per 1000 Adults,iPhone Users,Starbucks Locations,GEI Score
0,Switzerland,CH,97.611221,11700000,61,1
1,Canada,CA,221.126457,29390000,1468,1
2,Sweden,SE,40.178787,12639000,18,1
3,Denmark,DK,54.014077,7266000,21,1
4,Australia,AU,160.13778,31770000,22,1


### Split data into testing and training sets

In [84]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)

### 2 - Logistic Regression

In [86]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

print "Logistic Regression's accuracy is", round(logreg.score(X_test,y_test)*100),"%"

Logistic Regression's accuracy is 44.0 %


### What does that mean?

- The logistic regression model was able to classify the test data correctly 44% of the time. Although this score will have to be compared with the other models, its still considered a low score for a supervised learning algorithm.

- More adjustments would need to be made to the model, for example, removing outliers, feature scaling or normalization. 

### 3 - Naive Bayes:
**Gaussian** was used as opposed to **Bernoulli** because the features were continuous variables. If the situation was different then Bernoulli would have been used

In [83]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)

print "Naive Baye's Accuracy is", round(gnb.score(X_test, y_test) * 100),"%"

Naive Baye's Accuracy is 44.0 %


### 4 - Random Forest

In [78]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=2)
rf.fit(X_train, y_train)

print "Random Forests's Accuracy", round(rf.score(X_test,y_test) * 100), "%" 



Random Forests's Accuracy 61.0 %


### 5 - Support Vector Machines

In [70]:
from sklearn import svm

svm = (svm.SVC())
svm.fit(X_train, y_train)

print "SVM's Accuracy", round(svm.score(X_test,y_test) * 100, 2), "%" 

SVM's Accuracy 44.44 %
