#### Some Information About New Zealand's Past Citizenship, and This Data
This data shows information of having New Zealand citizenship about people from all around the world between 1949 and 2019.

From 1840 until 1 January 1949 most people in New Zealand were British subjects/citizens. Non-British were ‘aliens’. From 1 January 1949 when New Zealand citizenship was officially established, an ‘alien’ was defined as someone who was not a New Zealand citizen, and not British, British protected, or Irish. People with British nationality who were not New Zealand citizens were not aliens.

In 1977 a review of citizenship and residency removed the term alien from official use. Increasingly the focus has been on citizenship, or residency, or various other more temporary arrangements.

##### Explanation of the columns
* **Country of Birth** -> where she/he was born
* **Total**            -> total amount of people that got New Zealand citizenship from that country
* **%**                -> shows the percentage of people of each country


1. [Pandas Basics](#1)
1. [Pandas Intermediate](#2)
1. [Plotting in Pandas](#3)
1. [Visualization Tools](#4)
     * [Area Plot](#5)
     * [Histogram](#6)
     * [Bar Charts](#7)
     * [Pie Charts](#8)
     * [Box Plots](#9)
     * [Scatter Plots and how to plot a Liner Line](#10)
     * [Bubble Plots](#11)
     * [Waffle Charts](#12)
     * [Word Clouds](#13)
     * [Regression Plots](#14)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib as mpl
import matplotlib.pyplot as plt # standard python visualization library
%matplotlib inline

import matplotlib.patches as mpatches # needed for waffle Charts

# import library
import seaborn as sns


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

<a id="1"></a> <br>
### Pandas Basics

In [None]:
data = pd.read_csv("../input/new-zealand-citizenships-19492019/granted-citizenship-1949-2019.csv")
data.head() # view the first 5 rows

In [None]:
data.tail() # view the last 5 rows

There is TOTAL in Country of Birth column, it's not a country. I don't want to see this.

In [None]:
data = pd.read_csv("../input/new-zealand-citizenships-19492019/granted-citizenship-1949-2019.csv",
                   skipfooter=1) # i skipped last one row
data.tail() # view the first 5 rows

When analyzing a dataset, its always good idea to get basic information about dataframe.

In [None]:
data.info()

Since "%" column is an object, I should change it to float. Cause it should be numeric value. But in the column there is "%" letter at the end, i should delete it in order to change it to float.

In [None]:
data["%"] = data["%"].str[:4]
data.head()

Now we can change this column values to float, and change the column name, "%" doesn't make sense to me

In [None]:
data["%"] = data["%"].astype(float)
data.rename(columns={"%":"percentage_distribution"}, inplace=True)

In [None]:
data.columns.values # they all are object which I'd like so

In [None]:
data.index.values

In [None]:
data.shape # size of data frame(rows,columns)

Btw there are NaN values in the data. But what will we put there? Let's sum a row values to see if its equal to Total value. If so, then we will just change all NaN values with zero.

In [None]:
years = list(map(str,range(1949, 2020))) # there are years from 1949 to 2019, I want to get it as a list
# since range() function returns integer, i should use it with map() function to make them string
data.loc[0,years].sum() # sums the 0 indexed row values

I tried this with a few rows more, and the result was the same with Total Column. So what I will do is just to change NaN values with zero.

In [None]:
data.isnull().sum().sum() # we have this many NaN values in dataframe

In [None]:
data.fillna(0, inplace=True) # filled with zero
data.head()

In [None]:
data.isnull().sum().sum() # there is no NaN value anymore

In [None]:
# quick review of each column
data.describe() 

<a id="2"></a> <br>
### Pandas Intermediate

**Select Column**

There are two ways to filter on a column.

* This works if column doesn't have spaces or special characters.

df.column_name --> returns Series
* This can filter on multiple columns.

df["column"] --> returns Series

df[["column1","column2"]] --> returns DataFrame

In [None]:
data.Total # filter on Total column

In [None]:
data[["Country of Birth","Total","2019"]] # filter on multiple columns

##### Select Row
There are 2 ways.
* df.loc[label]  --> filters by labels of the index and column
* df.iloc[index] --> filters by positions of the index and column

The default index of the dataset is a numeric range from 0 to 324. It's difficult to query by a specific country. It's better to see Country column as index.

In [None]:
data.set_index("Country of Birth",inplace=True) # changing the index of dataframe
data.head(2)

I don't want to see index name here, so let's delete it.

In [None]:
data.index.name = None
data.head(2)

In [None]:
data[years].head(2)

Lets change the values of float to integer. It just shows number of people, so we don't need decimal point.

In [None]:
data[years] = data[years].astype(int)
data[years].head(2)

##### Scenario: View the number of citizenships from Russia
* For year 2019
* For years 2000 to 2015

In [None]:
data.loc["Russia","2019"]

In [None]:
years00_15 = list(map(str,range(2000,2016)))
data.loc["Russia",years00_15]

To filter a dataframe based on a condition, we just pass the condition as boolean vector

In [None]:
condition1 = data["2015"]>500 # returns False or True
data[condition1] # show me countries of people that got new zealand citizenship more than 500 in 2015

Lets have multiple conditions

In [None]:
condition2 = data["2010"]>1000
data[condition1 & condition2]

<a id="3"></a> <br>
### Plotting in Pandas

##### Line Plot

In [None]:
india = data.loc["India",years] 
india.plot() # as default, it plots like this
plt.show()

In [None]:
mpl.style.use(["ggplot"]) # mpl is matplotlib library
india.plot() 
plt.show()

In [None]:
# we can see how many types we can plot
plt.style.available # you can play around with them

In [None]:
india.index.values # they are object,i want to change them to str

In [None]:
india.index = india.index.map(int)
india.plot(kind="line") # its line by default

plt.title("New Zealand Citizenship for Indian people")
plt.xlabel("Years")
plt.ylabel("Number of citizenship")
plt.show()

We can write on plots

In [None]:
india.index = india.index.map(int)
india.plot(kind="line") 

plt.title("New Zealand Citizenship for Indian people")
plt.xlabel("Years")
plt.ylabel("Number of citizenship")
plt.text(1985,2200, "1990 Year") # x axis, y axis
plt.show()

##### Question: Compare the number of citizenship from India and England from 1980 to 2010

In [None]:
years80_10 = list(map(str,range(1980,2011)))
data_IE = data.loc[["India","England"],years80_10]
data_IE

data_IE is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe to swap the row and columns.

In [None]:
data_IE = data_IE.transpose()
data_IE.head()

In [None]:
data_IE.index = data_IE.index.map(int) # changing str to int
data_IE.plot(kind="line")
plt.title("Citizenship from India, and England")
plt.xlabel("Years")
plt.ylabel("Number of citizenship")
plt.show()

##### Question: Compare the top 4 countries that got new zealand citizenship from

In [None]:
# sort the data
data.sort_values(by="Total", ascending=False, axis=0, inplace=True)

# get the top 4 countries
data_top4 = data.head(4)
data_top4
# if i want only Total column, data.head(4).loc[:,"Total"]

In [None]:
# transpose the dataframe by years
data_top4 = data_top4[years].transpose()
data_top4

In [None]:
# plot the dataframe
data_top4.index = data_top4.index.map(int) # change them to int from str
data_top4.plot(kind="line",figsize=(12,6))

plt.title("top 4 countries that got New Zealand citizenship")
plt.xlabel("Years")
plt.ylabel("Number or Citizenship")
plt.show()

<a id="4"></a> <br>
### VISUALIZATION TOOLS

<a id="5"></a> <br>
#### Area Plot(Stacked Line Plot)
Area plots are stacked by default, and to produce a stacked area plot, each column must be either all positive or all negative values (any NaN values will defaulted to 0). To produce an unstacked plot, pass stacked=False

In [None]:
data_top4.head()

In [None]:
data_top4.plot(kind="area",
               stacked = False,
               figsize=(17,8))
plt.title("top 4 countries that got New Zealand citizenship")
plt.xlabel("Years")
plt.ylabel("Number or Citizenship")
plt.show()

The unstacked plot has a default transparency (alpha value) at 0.5 You can change it

##### Two types of plotting
There are two ways to plot with matplotlib using the scripting and artist layer.
* Scripting Layer - using matplotlib.pyplot as plt
we have been using this method so far.

an example:

data_top4.plot(kind="area",stacked = False, figsize=(17,8))

plt.title("top 4 countries that got New Zealand citizenship")

plt.xlabel("Years")

plt.ylabel("Number or Citizenship")

plt.show()
* Artist Layer - using axes instance from matplotlib
You can use axes instance of your plot, and store it in a variable. You can add more elements by using the syntax of "set_"

Lets do an example

In [None]:
ax = data_top4.plot(kind="area",stacked = False, figsize=(17,8))
ax.set_title("top 4 countries that got New Zealand citizenship")
ax.set_xlabel("Years")
ax.set_ylabel("Number or Citizenship")
plt.show()

<a id="6"></a> <br>
#### Histogram
A histogram is a way of representing the frequency distribution of numeric dataset. 
##### Question: What's the frequency distribution of the number of new citizenships from various countries in 2019

In [None]:
data["2019"].head()

In [None]:
count, bin_edges = np.histogram(data["2019"]) # returns 2 values
print(count) # frequency
print(bin_edges) # bin ranges, default=10 bins

there are between 0-468 people that got citizenship from each 312 countries

between 468-936 people that got citizenship from each 5 countries, so on

In [None]:
data["2019"].plot(kind="hist",figsize=(7,4))
plt.title("histogram of 324 countries that got new zealand citizenship in 2019")
plt.xlabel("Number of Citizenships")
plt.ylabel("Number of Countries")
plt.show()

The x-axis labels don't match with the bin size. Lets fix it

In [None]:
count, bin_edges = np.histogram(data["2019"])
data["2019"].plot(kind="hist",figsize=(10,4), xticks=bin_edges)
plt.title("histogram of 324 countries that got new zealand citizenship in 2019")
plt.xlabel("Number of Citizenships")
plt.ylabel("Number of Countries")
plt.show()

##### Question: What's the citizenship distribution for Sweden, Russia, Turkey

In [None]:
data.loc[["Sweden","Russia","Turkey"]]

In [None]:
# transpose the dataframe
data_srt = data.loc[["Sweden","Russia","Turkey"], years].transpose()
data_srt.head()

In [None]:
data_srt.plot(kind="hist",figsize=(7,4))
plt.title("Citizenship from Sweden, Russia, Turkey")
plt.xlabel("Number of Citizenship")
plt.ylabel("Years") # from 1949 to 2019 -> 70 years
plt.show()

Lets improve this plot

In [None]:
count,bin_edges = np.histogram(data_srt,15)

# unstacked histogram
data_srt.plot(kind="hist",
              figsize=(15,5),
              bins=bin_edges,
              alpha=0.5,
              xticks=bin_edges,
              color=['coral', 'darkslateblue', 'mediumseagreen'])

plt.title("Citizenship from Sweden, Russia, Turkey")
plt.xlabel("Number of Citizenship")
plt.ylabel("Years")
plt.show()

We can say -> there are 305.1-352 people for Russia that got new zealand citizenship for 0-5 years, so on

If we don't want the plots to overlap each other, we can stack them using the stacked parameter. Also adjust the min and max x-axis labels to remove the extra gap on the edges of the plot.

In [None]:
count, bin_edges = np.histogram(data_srt, 15)
xmin = bin_edges[0]    
xmax = bin_edges[-1]  

# stacked Histogram
data_srt.plot(kind='hist',
          figsize=(14, 6), 
          bins=15,
          xticks=bin_edges,
          color=['coral', 'darkslateblue', 'mediumseagreen'],
          stacked=True, # it adds up
          xlim=(xmin, xmax)
         )

plt.title("Citizenship from Sweden, Russia, Turkey")
plt.xlabel("Number of Citizenship")
plt.ylabel("Years")
plt.show()

<a id="7"></a> <br>
#### Bar Charts
Bar graphs usually represent numerical and categorical variables grouped in intervals. To create a bar plot, we can pass one of two arguments via kind parameter in plot():
* kind=bar creates a vertical bar plot
* kind=barh creates a horizontal bar plot

##### Question: Year 1977 was the time when the term of "alien" removed. The major people were from England in New Zealand. So I want to compare number of people from England that got New Zealand citizenship after 1977.

In [None]:
# getting the data
years77_19 = list(map(str, range(1977,2020)))
data_england = data.loc["England",years77_19]
data_england.head()

In [None]:
# plotting the data
data_england.plot(kind="bar",figsize=(12,6))
plt.xlabel("Year")
plt.ylabel("Number of Citizenships")
plt.title("New Zealand Citizenship of people from England from 1977 to 2019")
plt.show()


##### Question: Create a horizontal bar plot showing the total number of people that got New Zealand citizenship from the top 10 countries between 1977 and 2019

In [None]:
# sort dataframe on "Total" column
data.sort_values(by="Total", ascending=True, inplace=True)
# get top 10 countries
data_top10 = data["Total"].tail(10)

# plot the data
data_top10.plot(kind="barh", figsize=(10,10), color="steelblue")
plt.xlabel("Number of Citizenships")
plt.title("Top 10 Countries that got New Zealand Citizenship between 1977-2019")

# annotate value labels to each country
for index, value in enumerate(data_top10):
    label = format(int(value),",") # format int with commas
    
    # place text at the end of the bar(subtracting 14000 from x, and 0.1 from y to make it fit within the bar)
    plt.annotate(label, xy=(value - 14000, index - 0.10), color='white')
    
plt.show()

<a id="8"></a> <br>
#### Pie Charts

In [None]:
# last time, we sorted our data in ascending order, lets change it in descending
data.sort_values(by="Total", ascending=False, inplace=True)
data.head()

##### Question: Use pie plot to show people that got New Zealand citizenship from the top 5 countries
We will pass in kind = 'pie' keyword, along with the following additional parameters:
* autopct - is a string or function used to label the wedges with their numeric value.
* startangle - rotates the start of the pie chart by angle degrees counterclockwise from the x axis.
* shadow - draws a shadow beneath the pie to give a 3D feel.

In [None]:
data.head()["Total"].plot(kind="pie",
                   figsize=(5,6),
                   autopct='%1.1f%%',# add in percentages
                   startangle=90,     # start angle 90° (England)
                   shadow=True,       # add shadow      
                  )
plt.title('top 5 countries')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

Lets make some improvements. We can add colors, remove text labels on the pie chart and add it as seperate legend. Push out the percentages out of pie chart, and explode the parts which we would like to.

In [None]:
colors_list = ["aliceblue","burlywood","lemonchiffon","skyblue","slategrey"]
explode_list = [0, 0, 0, 0.1, 0.1] # ratio for each continent with which to offset, lets explode the lowest 2 countries
data.head()["Total"].plot(kind="pie",
                   figsize=(5,6),
                   autopct='%1.1f%%',    # add in percentages
                   startangle=90,        # start angle 90° (England)
                   shadow=True,          # add shadow      
                   labels=None,          # turn off labels on pie chart
                   pctdistance =1.12,    # the ratio between the center of each pie slice and the start of the text generated by autopct 
                   colors=colors_list,   # add custom colors
                   explode=explode_list  # 'explode' lowest 2 countries
                   )
# scale the title up by 10% to match pctdistance
plt.title("top 5 countries", y=1.1) # you can change "y" value

plt.axis('equal')

# add legend
plt.legend(labels=data.head().index, loc='upper left') 
plt.show()

<a id="9"></a> <br>
#### Box Plots
A box plot is a way of statistically representing the distribution of the data through five main dimensions:

* Minimum: Smallest number in the dataset.
* First quartile: Middle number between the minimum and the median.
* Second quartile (Median): Middle number of the sorted dataset.
* Third quartile: Middle number between median and maximum.
* Maximum: Highest number in the dataset.

##### Question: Use box plot for people from Turkey that got New Zealand citizenship between 2005-2019

In [None]:
years05_19 = list(map(str,range(2005,2020)))
data.loc[["Turkey"],years05_19]

In [None]:
data_turkey = data.loc[["Turkey"],years05_19].transpose()
data_turkey.head()

In [None]:
data_turkey.plot(kind="box", figsize=(5,5))

plt.title("people from Turkey that got New Zealand citizenship between 2005-2019")
plt.ylabel("Number of Citizenships")
plt.show()

We can make observations based on the plot. There are two outliers. Median is almost 35.

In [None]:
data_turkey.describe()

##### Question: Compare the distribution of the number of citizenship from Turkey, and Italy between 2005-2019

In [None]:
data_TI = data.loc[["Turkey","Italy"],years05_19].transpose()
data_TI.head()

In [None]:
data_TI.describe()

In [None]:
data_TI.plot(kind='box', figsize=(8, 6))
plt.title('Box plots of New Zealand Citizenships from Turkey and Italy(2005 - 2019)')
plt.xlabel('Number of Citizenships')
plt.show()

Do the same as horizontal box plot.

In [None]:
data_TI.plot(kind='box', figsize=(8, 6), color="red", vert=False)
plt.title('Box plots of New Zealand Citizenships from Turkey and Italy(2005 - 2019)')
plt.xlabel('Number of Citizenships')
plt.show()

#### Subplots
To visualize multiple plots together, we can create a figure and divide it into subplots, each containing a plot.

Typical syntax is :

fig = plt.figure() # create figure

ax = fig.add_subplot(nrows, ncols, plot_number)

Where; nrows and ncols are used to split the figure into (nrows * ncols) sub axes.

In [None]:
fig = plt.figure() # create figure
ax0 = fig.add_subplot(1,2,1) # add 1 row, 2 columns, and this is the first plot
ax1 = fig.add_subplot(1,2,2) # this is the second plot

# subplot1-> box plot
data_TI.plot(kind='box', figsize=(17, 6), color="red", vert=False,ax=ax0) # add to subplot1
ax0.set_title('Box plots of New Zealand Citizenships from Turkey and Italy(2005 - 2019)')
ax0.set_xlabel('Number of Citizenships')
ax0.set_ylabel("Country")

# subplot2 -> line plot
data_TI.plot(kind='line', figsize=(20, 6), ax=ax1) # add to subplot 2
ax1.set_title ('Line Plots of New Zealand Citizenships from Turkey and Italy(2005 - 2019)')
ax1.set_ylabel('Number of Citizenships')
ax1.set_xlabel('Year')

plt.show()

##### Question: Create a boxplot to visualize the distribution of number of people from the top 10 countries that got New Zealand citizenship, grouped by the decades 1990s, 2000s, 2010s.

In [None]:
data_top10 = data.sort_values(by="Total",ascending=False,axis=0).head(10) # get the top 10 countries by Total

# create list of years
years_90s = list(map(str,range(1990,2000)))
years_00s = list(map(str,range(2000,2010)))
years_10s = list(map(str,range(2010,2020)))

# get the value for each decades as series
data_90s = data_top10.loc[:,years_90s].sum(axis=1)
data_00s = data_top10.loc[:,years_00s].sum(axis=1)
data_10s = data_top10.loc[:,years_10s].sum(axis=1)

# merge series to dataframe
new_data = pd.DataFrame({"1990s":data_90s, "2000s":data_00s, "2010s":data_10s})

new_data.head()

In [None]:
new_data.describe()

In [None]:
# Plot the box plots
new_data.plot(kind='box', figsize=(10, 6))
plt.title('Citizenships from top 10 countries for decades 90s, 2000s and 2010s')
plt.show()

In order to be an outlier, the data value must be:

* larger than Q3 by at least 1.5 times the interquartile range (IQR), or
* smaller than Q1 by at least 1.5 times the IQR.
Let's look at decade 1990s as an example:

Q1 (25%) = 4932.5
Q3 (75%) = 14750
IQR = Q3 - Q1 = 9817.5 

Using the definition of outlier, any value that is greater than Q3 by 1.5 times IQR will be flagged as outlier.

Outlier > 14750 + (1.5 * 9817.5) ----> Outlier > 29476.25

In [None]:
# let's check how many entries fall above the outlier threshold 
new_data[new_data['1990s'] > 29476.25] # England is the outlier in 1990s 

<a id="10"></a> <br>
#### Scatter Plots and how to plot a Liner Line
Scatter plots look similar to line plots in that they both map independent and dependent variables on a 2D graph. While the datapoints are connected together by a line in a line plot, they are not connected in a scatter plot.

In [None]:
data[years].head()

In [None]:
data[years].sum(axis=0).head()

In [None]:
pd.DataFrame(data[years].sum(axis=0).head())

In [None]:
# we can use the sum() method to get the total population per year
data_total = pd.DataFrame(data[years].sum(axis=0))

# change the years to type int (it will be useful for regression later on)
data_total.index = map(int, data_total.index)

# reset the index to put in back in as a column in the data_total dataframe
data_total.reset_index(inplace = True)

# rename columns
data_total.columns = ['year', 'total'] 

data_total.head()

In [None]:
data_total.plot(kind="scatter", x="year", y="total", figsize=(10,6),color="blueviolet")

plt.title('Total New Zealand Citizenships from other countries for 1949 - 2019')
plt.xlabel('Year')
plt.ylabel('Number of Citizenships')

plt.show()

Let's try to plot a linear line of best fit, and use it to predict the number of immigrants in 2010

Get the equation of line of best fit. We will use Numpy's polyfit() method:

* x: x-coordinates of the data.
* y: y-coordinates of the data.
* deg: Degree of fitting polynomial. 1 = linear, 2 = quadratic, and so on.

In [None]:
x = data_total['year']      # year on x-axis
y = data_total['total']     # total on y-axis
fit = np.polyfit(x, y, deg=1)

fit

Since we are plotting a linear regression y= a*x + b, our output has 2 elements [4.66030885e+02, -9.11785107e+05] with the the slope in position 0 and intercept in position 1

In [None]:
data_total.plot(kind="scatter",x="year",y="total",figsize=(10,6),color="blueviolet")

plt.title('Total New Zealand Citizenships from other countries for 1949 - 2019')
plt.xlabel('Year')
plt.ylabel('Number of Citizenships')

# plot line of best fit
plt.plot(x, fit[0] * x + fit[1], color='red') # x is the Years, and y = a x + b
plt.annotate('y={0:.0f} x + {1:.0f}'.format(fit[0], fit[1]), xy=(2000, 10000)) # y equation label

plt.show()

# print out the line of best fit
'Number of Citizenships = {0:.0f} * Year + {1:.0f}'.format(fit[0], fit[1]) 

Using the equation of line of best fit, we can estimate the number of immigrants in 2010:

Number of Citizenships = 466 * Year + -911785

Number of Citizenships = 466 * 1990 + -911785

Number of Citizenships = 15555

When compared to the actuals from Citizenship (data_total[data_total["year"]==2010]) its 13067, its not bad. But as you can see from the plot, its good estimation in 1960, but not good in 2010. 

<a id="11"></a> <br>
#### Bubble Plots
A bubble plot is a variation of the scatter plot that displays three dimensions of data (x, y, z). The datapoints are replaced with bubbles, and the size of the bubble is determined by the third variable 'z', also known as the weight

Lets get data for Italy and Spain

In [None]:
# lets transpose our data to get country list as column
data_t = data[years].transpose()

# cast the Years (the index) to type int
data_t.index = map(int,data_t.index)

# label the index. This will automatically be the column name when we reset the index
data_t.index.name = 'Year'

# reset index to bring the Year in as a column
data_t.reset_index(inplace=True)

data_t.head()

Create the normalized weights.

There are several methods of normalizations in statistics. I will use feature scaling to bring all values into the range [0,1].

In [None]:
# normalize Italy data
norm_italy = (data_t['Italy'] - data_t['Italy'].min()) / (data_t['Italy'].max() - data_t['Italy'].min())

# normalize Spain data
norm_spain = (data_t['Spain'] - data_t['Spain'].min()) / (data_t['Spain'].max() - data_t['Spain'].min())

Plot the data

* We will pass in the weights using the s parameter. Given that the normalized weights are between 0-1, they won't be visible on the plot. Therefore we will:
    * multiply weights by 2000 to scale it up on the graph, and,

In [None]:
# Italy
ax0 = data_t.plot(kind='scatter',
                    x='Year',
                    y='Italy',
                    figsize=(14, 8),
                    alpha=0.5,                  # transparency
                    color='green',
                    s=norm_italy * 1000,  # pass in weights 
                    xlim=(1949, 2019) # x axis
                   )

# Argentina
ax1 = data_t.plot(kind='scatter',
                    x='Year',
                    y='Spain',
                    alpha=0.5,
                    color="blue",
                    s=norm_spain * 1000,
                    ax = ax0
                   )

ax0.set_ylabel('Number of Citizenships')
ax0.set_title('Citizenship of people from Italy and Spain from 1949 - 2019')
ax0.legend(['Italy', 'Spain'], loc='upper left', fontsize='x-large')
plt.show()

The size of the bubble corresponds to the magnitude of number of citizenships for that year. The larger the bubble, the more citizenships in that year.

<a id="12"></a> <br>
#### Waffle Charts

In [None]:
# lets get data for three countries
data_fgs = data.loc[["France","Germany","Singapore"],:]
data_fgs

In [None]:
# compute the proportion of each category with respect to the total
total_values = sum(data_fgs['Total'])
category_proportions = [(value / total_values) for value in data_fgs['Total']]

# print out proportions
for i, proportion in enumerate(category_proportions):
    print (data_fgs.index.values[i] + ': ' + str(proportion))

In [None]:
# defining the overall size of the waffle chart
width = 40 # width of chart
height = 10 # height of chart
total_num_tiles = width * height # total number of tiles

In [None]:
# compute the number of tiles for each category
tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions]

# print out number of tiles per category
for i, tiles in enumerate(tiles_per_category):
    print (data_fgs.index.values[i] + ': ' + str(tiles))

In [None]:
# initialize the waffle chart as an empty matrix
waffle_chart = np.zeros((height, width))

# define indices to loop through waffle chart
category_index = 0
tile_index = 0

# populate the waffle chart
for col in range(width):
    for row in range(height):
        tile_index += 1

        # if the number of tiles populated for the current category is equal to its corresponding allocated tiles...
        if tile_index > sum(tiles_per_category[0:category_index]):
            # ...proceed to the next category
            category_index += 1       
            
        # set the class value to an integer, which increases with class
        waffle_chart[row, col] = category_index
waffle_chart

As expected, the matrix consists of three categories and the total number of each category's matches the total number of tiles allocated to each category.

In [None]:
# lets find if we did right
unique, counts = np.unique(waffle_chart, return_counts=True)
dict(zip(unique, counts)) 
# result is true, so France:81, Germany:186, Singapore:133

In [None]:
# Map the waffle chart matrix into a visual
fig = plt.figure()

# use matshow to display the waffle chart
colormap = plt.cm.coolwarm
plt.matshow(waffle_chart, cmap=colormap)
plt.colorbar()
plt.show()

In [None]:
# Lets pretify this

# instantiate a new figure object
fig = plt.figure()

# use matshow to display the waffle chart
colormap = plt.cm.coolwarm
plt.matshow(waffle_chart, cmap=colormap)
plt.colorbar()

# get the axis
ax = plt.gca()

# set minor ticks
ax.set_xticks(np.arange(-.5, (width), 1), minor=True)
ax.set_yticks(np.arange(-.5, (height), 1), minor=True)
    
# add gridlines based on minor ticks
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

plt.xticks([])
plt.yticks([])


# compute cumulative sum of individual categories to match color schemes between chart and legend
values_cumsum = np.cumsum(data_fgs['Total'])
total_values = values_cumsum[len(values_cumsum) - 1]

# create legend
legend_handles = []
for i, category in enumerate(data_fgs.index.values):
    label_str = category + ' (' + str(data_fgs['Total'][i]) + ')'
    color_val = colormap(float(values_cumsum[i])/total_values)
    legend_handles.append(mpatches.Patch(color=color_val, label=label_str))

# add legend to chart
plt.legend(handles=legend_handles,
           loc='lower center', 
           ncol=len(data_fgs.index.values),
           bbox_to_anchor=(0., -0.2, 0.95, .1)
          )
plt.show()

<a id="13"></a> <br>
#### Word Clouds
The more a specific word appears in a source of textual data, the bigger and bolder it appears in the word cloud.

In [None]:
# import package and its set of stopwords
from wordcloud import WordCloud, STOPWORDS

import urllib.request
response = urllib.request.urlopen("https://www.w3.org/TR/PNG/iso_8859-1.txt")
text_file = response.read().decode('utf-8')

In [None]:
# use the stopwords that we imported from word_cloud
stopwords = set(STOPWORDS)

# instantiate a word cloud object
text_wc = WordCloud(
    background_color='white',
    max_words=2500, # used the first 250 letters in text file
    stopwords=stopwords
)

# generate the word cloud
text_wc.generate(text_file)

In [None]:
fig = plt.figure()
fig.set_figwidth(8)    # set width
fig.set_figheight(10)  # set height

# display the word cloud
plt.imshow(text_wc, interpolation='bilinear')
plt.axis('off')
plt.show()

In the first 2500 words in the text file, the most common words are  Small, Letter, Capital, Sign, Grave, and so on.

For example I don't want to see "Letter" word. So I need to add this as stopword then

In [None]:
stopwords.add('letter') # add the words said to stopwords

# re-generate the word cloud
text_wc.generate(text_file)

# display the cloud
fig = plt.figure()
fig.set_figwidth(8) # set width
fig.set_figheight(10) # set height

plt.imshow(text_wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Let's generate sample text data from our immigration dataset, say text data of 70 words.

In [None]:
data.head()

In [None]:
total_citizenship = data["Total"].sum()
total_citizenship

Using countries with single-word names, let's duplicate each country's name based on how much they contribute to the total citizenship.

In [None]:
max_words = 70
word_string = ''
for country in data.index.values:
    # check if country's name is a single-word 
    if len(country.split(' ')) == 1:
        repeat_num_times = int(data.loc[country, 'Total']/float(total_citizenship)*max_words)
        word_string = word_string + ((country + ' ') * repeat_num_times)
                                     
# display the generated text
word_string

In [None]:
# create the word cloud
wordcloud = WordCloud(background_color='white').generate(word_string)

# display the cloud
fig = plt.figure()
fig.set_figwidth(10)
fig.set_figheight(12)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

<a id="14"></a> <br>
#### Regression Plots

In [None]:
data_total.head()

In [None]:
ax = sns.regplot(x='year', y='total', data=data_total)

In [None]:
# changing the color and adding marker
ax = sns.regplot(x='year', y='total', data=data_total,color="green",marker="+")

We can make bigger size of figure, increase the size of markers, and tickmark labels, and change the background

In [None]:
plt.figure(figsize=(12, 7))
ax = sns.regplot(x='year', y='total', data=data_total,color="green",marker="+", scatter_kws={'s': 150})
sns.set(font_scale=1.5)
sns.set_style('ticks') # change background to white background

ax.set(xlabel='Year', ylabel='Total Citizenships') # add x- and y-labels
ax.set_title('New Total Citizenship of New Zealand from 1949 - 2019') # add title
plt.show()

I'd like to hear your opinion about this kernel, thanks.