The purpose of this notebook study is to examine the top 100 languages used globally with respect to native and non native speakers. 

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
print ("Good to go")

Good to go


In [2]:
file_pathway = "../input/100-most-spoken-languages-around-the-world/Top 100 Languages.csv"
data = pd.read_csv(file_pathway, decimal= ',')
data.dtypes

Language            object
Total Speakers       int64
Native Speakers    float64
Origin              object
dtype: object

In [3]:
data.head(5)

Unnamed: 0,Language,Total Speakers,Native Speakers,Origin
0,English,1132366680,379007140.0,Indo-European
1,Mandarin Chinese,1116596640,917868640.0,Sino-Tibetan
2,Hindi,615475540,341208640.0,Indo-European
3,Spanish,534335730,460093030.0,Indo-European
4,French,279821930,77177210.0,Indo-European


As of now, our Native Speakers column is producing values with ".0" trailing because this column data type is float. 

We'll have to change this to integer after missing values (NaNs) are removed. 

Additonally languages are indexed 0-99 as opposed to 1-100. 
The following code is used to reflect the necessary change.

In [4]:
data.index=np.arange(1,len(data)+1)

In [5]:
data.columns=(["Language","Total_Speakers","Native_Speakers","Origin"])
data.Language=data.Language.astype('category')
data.Origin=data.Origin.astype('category')

There are 42 different Indo-European languages in the database that account for more than half (55.7%) of all the total speakers among the 100 languages when grouped by language origins.

In [6]:
data['Origin'].value_counts()

Indo-European    42
Afro-Asiatic     15
Niger-Congo      12
Sino-Tibetan      9
Austronesian      9
Turkic            4
Dravidian         4
Kra-Dai           2
Uralic            1
Koreanic          1
Japanic           1
Name: Origin, dtype: int64

In [7]:
fig = px.pie(data, values='Total_Speakers', names='Origin', title="Percentage Breakdown of Language Groups")
fig.show()

In [8]:
fig=px.bar(data, x="Language", y="Total_Speakers", color="Origin", title="Total Speakers Grouped by Origin",
          width=1500, height=800)
fig.update_yaxes(ticks="outside", tickwidth=2, tickcolor='crimson', ticklen=10)
fig.update_layout(
    title="Total Speakers of Languages Grouped By Origin",
    title_font_size=20,
    yaxis_title="Total Speakers",
    legend_title="Language Origin",
    font=dict(
        family="Courier New, monospace",
        color="Black"))
fig.show()

Before looking into the data further, missing values will be dealt with. 

In [9]:
null_columns=data.columns[data.isnull().any()]
data[null_columns].isnull().sum()

Native_Speakers    4
dtype: int64

In [10]:
is_NaN=data.isnull()
NaN_rows=is_NaN.any(axis=1)
rows_with_NaN=data[NaN_rows]
print(rows_with_NaN)

              Language  Total_Speakers  Native_Speakers         Origin
6      Standard Arabic       273989700              NaN   Afro-Asiatic
36            Filipino        45000000              NaN   Austronesian
50     Nigerian Pidgin        30000000              NaN  Indo-European
96  Cameroonian Pidgin        12000000              NaN  Indo-European


There are four languages in the database that have missing values for the number of Native Speakers. 

For these languages only the number of total speakers is provided.
They are:
- Standard Arabic (273.9 million, Afro-Asiatic)
- Filipino (45 million, Austronesian)
- Nigerian Pidgin (30 million, Indo-European) 
- Cameroonian Pidgin (12 million, Indo-European). 

In terms of Total Speakers, these are the 6th, 36th, 50th and 96th most spoken language in the world respectively. 

We will delete these because when comparing Native Speakers they do not add significance.

In [11]:
data.drop([6,36,50,96], inplace=True)

In addition to converting the Native Speakers column into an integer data type following the elimination of NaNs, we will create 2 columns, Insularity and Non Native. Insularity is expressed as a percentage of the number of Native Speakers divided by Total Speakers for a given language. The higher the outcome, the more insular the language is as speakers are mostly native. Conversely, the lower the number, the more the given language is being spread amongst non native speakers as second or third languages acquired etc.

Non Native is simply calculated as the number of Total Speakers subtracted by the number of Native Speakers. This column tells us the number of speakers who have learned said language perhaps for work, travel, translation, pleasure or have picked it up along the way.

We will round the Insularity to 4 decimal places rather than the standard 2. The reason being, a minor number of Non Native speakers will falsely round to 100% insularity. Here is a demonstration of this inaccuracy: 

In [12]:
data.Native_Speakers=data.Native_Speakers.astype('int')
data['Non_Native']=data['Total_Speakers'] - data['Native_Speakers']
data.Non_Native=data.Non_Native.astype('int')
data['Insularity']=data['Native_Speakers']/data['Total_Speakers']*100
data['Insularity']=data['Insularity'].round(decimals=2)
total_insularity=data.loc[data['Insularity'] == 100]
total_insularity.head(10)

Unnamed: 0,Language,Total_Speakers,Native_Speakers,Origin,Non_Native,Insularity
17,Western Punjabi,92725700,92725700,Indo-European,0,100.0
21,Korean,77264890,77264890,Koreanic,0,100.0
24,Javanese,68277600,68277600,Austronesian,0,100.0
26,Egyptian Spoken Arabic,64618100,64618100,Afro-Asiatic,0,100.0
31,Iranian Persian,52782160,52782160,Indo-European,0,100.0
34,Hakka Chinese,48467490,48467490,Sino-Tibetan,0,100.0
35,Jinyu Chinese,46900000,46900000,Sino-Tibetan,0,100.0
42,Xiang Chinese,37300000,37300000,Sino-Tibetan,0,100.0
46,Eastern Punjabi,32601140,32600670,Indo-European,470,100.0
47,Sunda,32400000,32400000,Austronesian,0,100.0


As you can see the Eastern Punjabi language in the database records 470 Non Native speakers, however Insularity is falsely rounded to 100%. To reflect a more accurate reading of Insularity, 4 decimals places will be required. 

In [13]:
data['Insularity']=data['Native_Speakers']/data['Total_Speakers']*100
data['Insularity']=data['Insularity'].round(decimals=4)
total_insularity=data.loc[data['Insularity'] == 100]
len(total_insularity)

36

In [14]:
#Languages with 100% insularity based on data provided
total_insularity

Unnamed: 0,Language,Total_Speakers,Native_Speakers,Origin,Non_Native,Insularity
17,Western Punjabi,92725700,92725700,Indo-European,0,100.0
21,Korean,77264890,77264890,Koreanic,0,100.0
24,Javanese,68277600,68277600,Austronesian,0,100.0
26,Egyptian Spoken Arabic,64618100,64618100,Afro-Asiatic,0,100.0
31,Iranian Persian,52782160,52782160,Indo-European,0,100.0
34,Hakka Chinese,48467490,48467490,Sino-Tibetan,0,100.0
35,Jinyu Chinese,46900000,46900000,Sino-Tibetan,0,100.0
42,Xiang Chinese,37300000,37300000,Sino-Tibetan,0,100.0
47,Sunda,32400000,32400000,Austronesian,0,100.0
49,Sundanese Spoken Arabic,31940300,31940300,Afro-Asiatic,0,100.0


In [15]:
fig = px.bar(total_insularity, x="Language", y="Total_Speakers", color='Origin', width=1000, height=700,title="Total Insularity Languages by Origin")
fig.update_xaxes(tickangle=90, tickfont=dict(family='Courier New', color='black', size=14))
fig.update_yaxes(ticks="outside", tickwidth=2, tickcolor='crimson', ticklen=10)
fig.update_layout(
    title="Completely Insular Languages Grouped By Origin",
    title_font_size=20,
    yaxis_title="Total Speakers",
    legend_title="Language Origin",
    font=dict(
        family="Courier New, monospace",
        color="Black"))
fig.show()

There are a total of 36 languages in the database in which Total Speakers is equal to the number of Native Speakers, indicating that there are zero Non Native Speakers. Rather than labelling these languages as unforgivingly insular, a more realistic picture of this may simply be a result of incomplete data. For instance, the Korean language is spoken by several non Koreans which can be observed whenever spending time there. 

On another elaborative note, the language belongs to origin group of Koreanic. Interestingly enough however, both the Korean and the Tamil languages (Dravidian origin) make use of the same word or sound we can say for the words meaning "mother" and "father". This could be a great discussion to be had between Jonathan D. Ripley of the Department of South Asian Studies at Harvard University and researcher Jung Nam Kim that may lead to Chola and Song dynasty unravellings. Of course such talks can only safely occur outside of Ontario Canada. 

Nonetheless for the purpose of this notebook, we will exclude all languages in the database without non native speakers as they lend no further comparative value. 

After dropping these 36 languages in which Non Native information is absent in addition to the 4 languages dropped previously for missing data, we will be left with 60 of the inital 100 in the database to examine.

In [16]:
data.drop(data.loc[data['Non_Native']==0].index, inplace=True)
len(data)

60

In [17]:
most_total=data.sort_values(by='Total_Speakers', ascending=False)
ab=most_total.head(10)
ab

Unnamed: 0,Language,Total_Speakers,Native_Speakers,Origin,Non_Native,Insularity
1,English,1132366680,379007140,Indo-European,753359540,33.4704
2,Mandarin Chinese,1116596640,917868640,Sino-Tibetan,198728000,82.2023
3,Hindi,615475540,341208640,Indo-European,274266900,55.4382
4,Spanish,534335730,460093030,Indo-European,74242700,86.1056
5,French,279821930,77177210,Indo-European,202644720,27.5808
7,Bengali,265042480,228289600,Indo-European,36752880,86.1332
8,Russian,258227760,153746530,Indo-European,104481230,59.5391
9,Portuguese,234168620,220762620,Indo-European,13406000,94.2751
10,Indonesian,198733600,43364600,Austronesian,155369000,21.8205
11,Urdu,170208780,68622980,Indo-European,101585800,40.3169


When taking a further look into the Top 10 most spoken languages in terms of Total Speakers by way of Native and Non Native Speakers, a visualization articulates the workings of Insularity on the population on a language's total number of speakers. 
In the top 10 most spoken group, fours languages that stand out in terms of reach outside of Native Speakers are:

- English (33.47% insularity; most spoken language)
- French (27.58% insularity; 5th most spoken language)
- Indonesian (21.82% insularity; 10th most spoken language)
- Urdu (40.32% insularity; 11th most spoken language)

These low levels of insularity amongst these popular languages all indicate that more than 50% (for English two-thirds) of all speakers are Non Natives!

In [18]:
fig = px.bar(ab, x="Language", y=["Native_Speakers","Non_Native"], title="Breakdown of 10 Most Spoken Languages")
fig.show()

There are a total of 15 languages in the database in which the majority (50%+) of speakers are Non Natives.
More simply, these languages have more Non Native speakers today than Native speakers.
Namely from least insular, they are as follows:

- Swahili (16.3% - insert 2pac lyric here)
- Jula (17.68%)
- Indonesian (21.82%)
- French (27.58)
- Bamanankan (29.09%)
- English (33.47%)
- Northern Sotho (33.73%)
- Thai (34.56%)
- Urdu (40.32%)
- Afrikaans (41.26%)
- Southern Sotho (41.59%)
- Sadri (42.30%)
- Setswana (42.55%)
- Xhosa (42.66%)
- Zula (43.48%)

In [19]:
least_insular=data.loc[data['Insularity'] < 50]
least_insular.sort_values('Insularity')

Unnamed: 0,Language,Total_Speakers,Native_Speakers,Origin,Non_Native,Insularity
14,Swahili,98327740,16027740,Niger-Congo,82300000,16.3003
93,Jula,12486000,2208000,Niger-Congo,10278000,17.6838
10,Indonesian,198733600,43364600,Austronesian,155369000,21.8205
5,French,279821930,77177210,Indo-European,202644720,27.5808
81,Bamanankan,14102320,4102320,Niger-Congo,10000000,29.0897
1,English,1132366680,379007140,Indo-European,753359540,33.4704
83,Northern Sotho,13731000,4631000,Niger-Congo,9100000,33.7266
28,Thai,60657660,20657660,Kra-Dai,40000000,34.0561
11,Urdu,170208780,68622980,Indo-European,101585800,40.3169
69,Afrikaans,17534580,7234580,Indo-European,10300000,41.2589


In [20]:
fig = px.sunburst(data, path=['Origin','Language'], values='Total_Speakers',
                  color='Non_Native', color_continuous_scale='RdBu', title="Total and Non Native Speakers")
fig.show()

In [21]:
fig = px.scatter(data, x="Total_Speakers", y="Native_Speakers", size="Non_Native", color="Origin", 
                 title="With Bubble Size as Non Native Speakers - Total Speakers vs. Native Speakers",
           hover_name="Language", log_x=True, log_y=True,size_max=75,width=900, height=600)
fig.update_layout(showlegend=True,
    xaxis={'title':'Total Number of Speakers',},
    yaxis={'title':'Total Number of Native Speakers'})
fig.show()

In [22]:
most_insular=data.loc[data['Insularity']>85]
len(most_insular)

31

On the flip side, by arbitrary methodology if we select greater than 85% to signify that a language is insular, there are 31, or just over half of our reduced from 100 to 60 languages in the database that qualify in being insular. 

Seeing Spanish, Portuguese, Italian, Polish and Japanese as insular is particulary surprising given the use of these languages in Rosetta Stone products. 
Nonetheless, examining the Total Number of Speakers which is commonly done, does not quite reveal the true workings of language reach. 

In [23]:
most_insular.sort_values('Insularity')

Unnamed: 0,Language,Total_Speakers,Native_Speakers,Origin,Non_Native,Insularity
4,Spanish,534335730,460093030,Indo-European,74242700,86.1056
7,Bengali,265042480,228289600,Indo-European,36752880,86.1332
15,Marathi,95312800,83112800,Indo-European,12200000,87.2
97,Sylheti,11800000,10300000,Indo-European,1500000,87.2881
16,Telugu,93040340,82040340,Dravidian,11000000,88.1772
70,Sinhala,17287880,15287880,Indo-European,2000000,88.4312
40,Odia,38051547,34461520,Indo-European,3590027,90.5654
48,Algerian Spoken Arabic,32387600,29387600,Afro-Asiatic,3000000,90.7372
19,Tamil,80989130,75039130,Dravidian,5950000,92.6533
29,Gujarati,60588970,56408970,Indo-European,4180000,93.1011
