# INM430 - Tiny DS Project Progress Report


**Student Name:** Zacharias Detorakis

**Project Title:** Impact of global economic crisis to migration in UK

## Part-1: Data source and domain description (maximum 150 words):

The scope of this analysis is to investigate the effects of global economic crisis to international migration in the recent years. Even though the long-term effect may need a few more years before it can be properly studied there is now enough evidence to support a short-term analysis. Two separate datasets will be used to support this analysis:
- One from the World Bank [ http://datatopics.worldbank.org/world-development-indicators/ ] that has several indices representing the financial status of a country. These data spans 4 decades and comprises a plethora of indicators that need to be assessed for their relevance.
- One from the department of work and pensions [ https://data.london.gov.uk/dataset/national-insurance-number-registrations-overseas-nationals-borough ] which contains data for the NI numbers issued to other nationals over that last few years. UK is used here as a benchmark as it attracts professionals from all over the globe for several reasons.

***

## Part-2: Analysis Strategy and Plans (maximum 200 words):

Initial investigation: understand which of the reported attributes are relevant to the analytical questions that we need to answer. There are numerous columns so we may need to see how we can reduce some of those dimensions (perhaps use some correlation technique to identify dependencies within the data). Perhaps gather some additional data to inform distances between the stated countries and the UK as distance might is an independent factor that affects migration regardless of the crisis so it might be beneficial to include in the model.

Assess the data quality and ensure that there is enough data in both datasets to cover the period in question. Identify a way to merge the data (i.e. common keys). Because this is timeseries data and we can to measure the effect of a trend we may need to derive increase or decrease over a period. Perhaps also normalise the data based on the population of the different countries to have all data at the same scale.

I’ve not yet decided on the data analysis algorithms as I am expecting some of that to be informed by the feature engineering phase and/or additional knowledge gained in later lectures for the course.

...
***

## Part-3: Initial investigations on the data sources (maximum 150 words): 

... Insert your text here ...

***

## Part-4: Python code for initial investigations

In [None]:
# The following is essential to get your matplotlib plots inline, so do not miss this one if you have graphics.
%matplotlib inline

# Import Libraries
import csv as csv 
import numpy as np
import pandas as pd

In [5]:
# This cell is where you can copy + paste your Python code which loads your data and produces 
# When you press CTRL+Enter, this cell will execute and produce some output
# You can develop your code in Spyder (or another IDE) and copy + paste over here

# Step-1: Load your data

# Step-2: Get an overview of the data features, some suggestions to look for:
#         number of features, data types, any missing values?, 
#         any transformations needed?

# Start with your import (s) here.

# The following is essential to get your matplotlib plots inline, so do not miss this one if you have graphics.

# Continue here with your code

# melt()

wdi_data = pd.read_csv("WDIData.csv")
nin_data = pd.read_csv("national-insurance-number-registrations.csv",encoding='latin-1')

In [6]:
indicators = wdi_data["Indicator Name"].unique()
print("Number of Indicators: {}".format(len(indicators)))

countries = wdi_data["Country Name"].unique()
print("Number of Countries: {}".format(len(countries)))

print(wdi_data.columns)

Number of Indicators: 1431
Number of Countries: 264
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', 'Unnamed: 64'],
      dtype='object')


In [7]:
nin_data.Year.unique()

array(['2002/03', '2003/04', '2004/05', '2005/06', '2006/07', '2007/08',
       '2008/09', '2009/10', '2010/11', '2011/12', '2012/13', '2013/14',
       '2014/15', '2015/16', '2016/17', '2017/18', '2018/19'],
      dtype=object)

Given that we only have NI Number from 2002 onwards we can drop all previous years from the dataframe to free up space and focus the analysis on the years that are of interest

In [8]:
years_to_delete = list(range(1960,2001))
years_to_delete = list(map(str, years_to_delete))
wdi_data.drop(labels=years_to_delete,axis=1,inplace=True)

print(wdi_data.columns)

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009',
       '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018',
       '2019', 'Unnamed: 64'],
      dtype='object')


Next I'll try to see how the 1431 indicators best describe the wealth index and then drill down on additional exploratory data analysis before I proceed with selecting the indicators to use for feature engineering.