In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [3]:
#importing the dataset from Starbucks 
Data_ST = pd.read_csv("Data/starbucks.csv")
Data_ST = pd.DataFrame(Data_ST.loc[:, 'Country'])
#Calculating the frequency of locations per country 
Data_ST['freq'] = Data_ST.groupby('Country')['Country'].transform('count')


In [4]:
Data_ST = Data_ST.drop_duplicates()
Data_ST = Data_ST.reset_index(drop=True)
#Renaming columns
Data_ST.columns = ['Code', 'Starbucks locations']

### Why Starbucks as a feature?

- Starbucks reflects a countries welfare believe it nor not. When you see that many Starbucks locations in poor countries who are actually tea drinkers then that tells you something about the country. Maybe these countries aren't as poor as we thought or maybe demographics are changing.

- Merchants who use fintech are mostly coffee shops and restaurants, and Starbucks is the third largest chain by the number of locations so as the number of Starbucks locations increase so does the competition.

- Starbucks sounds like a cool variable but more importantly, it's easy to scale by just use Google maps fro example.






In [5]:
#importing the country code dataset
Data_code = pd.read_csv('https://raw.githubusercontent.com/datasets/country-list/master/data.csv')
#importing the GEI index dataset
Data_gei = pd.read_csv('Data/gei.csv')
#Renaming columns
Data_code.columns = ["Country", 'Code']
Data_gei.columns = ['Rank', 'Country', "GEI Score"]
Data_gei.head()

Unnamed: 0,Rank,Country,GEI Score
0,1.0,United States,83.4
1,2.0,Switzerland,78.0
2,3.0,Canada,75.6
3,4.0,Sweden,75.5
4,5.0,Denmark,74.1


### What is the GEI Score?

- It is the model's label and it is the Global Entrepreneurship Index and it's an annual index that measures the health of the entrepreneurship ecosystems in each of 137 countries. It then ranks the performance of these against each other. The Global Entrepreneurship and Development Institute is responsible for that index. Check it out [here.](https://thegedi.org)




In [6]:
#Merging datasets
Data_label = pd.merge(Data_code,Data_gei, on="Country", how="inner" )
final = pd.merge(Data_ST,Data_label, on="Code", how="inner" )



In [8]:
#importing the Mobile dataset
data_mobile = pd.read_csv('Data/mobile.csv')
data_mobile.columns = ["Country", "Total Subscribers"]
#27% is Apple's market share according to Gartner's SAMSUNG vendor rating Report
data_mobile.loc[:,'Total Subscribers'] * 0.27
final = pd.merge(data_mobile,final, on="Country", how="inner" )
final.rename(columns={'Total Subscribers_x': 'iPhone Users', 'Starbucks locations': 'Starbucks Locations'}, inplace=True)

### iPhone Users

- iPhone users were calculated by multiplying Apple's global market share by the number of mobile cellular telephone subscribers per country. Apple's market share was projected by Garter's vendor rating report on SAMSUNG and the mobile subscriber's data was exported from the CIA's World Factbook.
- iPhones are considered by many as the main tool for FinTech penetration. In fact, mobile applications were considered by many as a supportive marketing tool to the core business model but nowadays companies it is the business model and that can be seen by companies like Venmo, Square and Stripe.

In [9]:
#importing the ATM dataset
data_atm = pd.read_csv("Data/ATMs.csv")
data_atm.columns = ["Country", "ATMs per 1000 Adults"]
data_atm = data_atm.dropna()
data_atm = data_atm.reset_index(drop=True)


### ATMs per 1000 Adults, Really?

Yes, the ATMs feature was selected for the following reasons:
- It has the most correlation with the GEI scores, labels.
- It is the primary indicator for financial inclusion and therefore FinTech companies
- Mobile banking and mobile payments could come as the next step if the ATM mentality has already been established in a country.

The dataset was exported from the world bank open data project.


In [10]:
final = pd.merge(data_atm, final, on="Country", how="inner" )
final = final.dropna()
final = final.reset_index(drop=True)
final.rename(columns={'ATMs per 1000 Adults_x': 'ATMs per 1000 Adults', 'Total Subscribers': 'iPhone Users'}, inplace=True)
col = ["Country", "Code", "ATMs per 1000 Adults", "iPhone Users", "Starbucks Locations", "GEI Score" ]
final = final[col]
final

Unnamed: 0,Country,Code,ATMs per 1000 Adults,iPhone Users,Starbucks Locations,GEI Score
0,United Arab Emirates,AE,61.909397,17943000,144,58.8
1,Argentina,AR,58.841633,60664000,108,22.2
2,Australia,AU,160.13778,31770000,22,72.5
3,Austria,AT,119.094724,13471000,18,63.5
4,Azerbaijan,AZ,35.082586,10697000,4,31.1
5,Belgium,BE,93.653161,12938000,19,63.0
6,Bulgaria,BG,90.889727,9195000,5,22.7
7,Brazil,BR,114.793823,257814000,102,20.1
8,Canada,CA,221.126457,29390000,1468,75.6
9,Switzerland,CH,97.611221,11700000,61,78.0


In [11]:
final.describe()

Unnamed: 0,ATMs per 1000 Adults,iPhone Users,Starbucks Locations,GEI Score
count,52.0,52.0,52.0,52.0
mean,71.338961,62046880.0,128.653846,43.642308
std,41.008623,147917800.0,271.050164,18.342037
min,10.877094,807000.0,2.0,16.5
25%,48.42597,9299250.0,11.0,27.025
50%,59.448228,19465000.0,28.0,40.35
75%,91.580586,56960250.0,106.5,58.8
max,221.126457,1011054000.0,1468.0,78.0


In [12]:
final.corr()

Unnamed: 0,ATMs per 1000 Adults,iPhone Users,Starbucks Locations,GEI Score
ATMs per 1000 Adults,1.0,-0.126585,0.480289,0.456772
iPhone Users,-0.126585,1.0,0.121793,-0.248659
Starbucks Locations,0.480289,0.121793,1.0,0.147356
GEI Score,0.456772,-0.248659,0.147356,1.0


### Starbucks correlation

An extremely low correlation between Starbucks locations and the GEI score, only 0.14, which is very weird. Does Starbucks know that?
30min later, I decided to email the Starbucks data science team and see what they think about that and that's only because I have access to ZoomInfo, a subscription-based contact information software.

This was the email:

Hi Mr. XXX,

 I am a student at George Washington University.

I got your info from Linkedin. I am building a machine learning algorithm where I am using the following features:
Starbucks locations per country
Entrepreneurship Index per country
Today, I came up with a very interesting finding. The correlation between the two features is exactly 0.14. Can you imagine !!!

What do you think about that? and what are the factors that you look for when you expand internationally then if it's not entrepreneurship? 

Cheers


I sent an email to almost 20 employees and I didn't get a reply back. I guess I'll have to email them again with the model's findings. Lets move on to that.
