# Survey as Text: An Open Approach to Kaggle's Annual Machine Learning and Data Science Survey Competition

## I've built some tools for [Kaggle's annual Machine Learning and Data Science Survey competition](https://www.kaggle.com/c/kaggle-survey-2021).

#### Below I use them to analyze the [Kaggle 2021 response dataset](https://www.kaggle.com/c/kaggle-survey-2021/data).

**First**, identifying the Kaggle data's topic-structure at a high level. **Next** selecting and describing one Kaggle user-group. **Then** looking at associations within and across the Kaggle data and other datasets, identifying and analyzing coding language usage rates across countries, cities and attempting to suggest life-details of the selected user group. **Finally**, arguing that a story has emerged.      

![](https://upload.wikimedia.org/wikipedia/commons/2/2a/Funerary_text%2C_Egypt%2C_hieratic_script%2C_Third_Intermediate_Period%2C_1069-664_BC%2C_linen_with_ink_-_Albany_Institute_of_History_and_Art_-_DSC08195.JPG)

*Above: Funerary text, Egypt, hieratic script, Third Intermediate Period, 1069-664 BC, linen with ink - Albany Institute of History and Art [Wikipedia Commons](https://upload.wikimedia.org/wikipedia/commons/2/2a/Funerary_text%2C_Egypt%2C_hieratic_script%2C_Third_Intermediate_Period%2C_1069-664_BC%2C_linen_with_ink_-_Albany_Institute_of_History_and_Art_-_DSC08195.JPG)*

In [None]:
!pip install ruptures
!pip install country_converter
!pip install resize-and-crop
!pip install image_tools
from IPython.display import Markdown as md
!jupyter nbextension enable --py widgetsnbextension
import ipywidgets as widgets

from IPython.display import display
import ipywidgets as widgets
import warnings
import sys
import spacy
import sklearn
import ruptures as rpt
import re
import random as rn
import random
import pandas as pd
import os
import numpy as np
import nltk; nltk.download('stopwords')
import nltk
import matplotlib.pyplot as plt
import math
import logging
import image_tools
import glob
import gensim.corpora as corpora
import gensim
import gc
from sklearn.cluster import KMeans
from resize_and_crop import resize_and_crop
from pprint import pprint
from PIL import Image
from nltk.corpus import stopwords
from matplotlib.pyplot import figure
from IPython.display import Image
from image_tools.sizes import resize_and_crop
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from collections import OrderedDict
pd.set_option('display.max_colwidth', None)
nltk.download('stopwords')
stop_words = stopwords.words('english')
%matplotlib inline
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
warnings.filterwarnings("ignore",category=DeprecationWarning)

# 1.  The land

Our initial [**Exploratory data analysis (EDA)**](https://en.wikipedia.org/wiki/Exploratory_data_analysis) of the Kaggle 2021 data will not be traditional. Instead we will make a **topic map** of the most important words and phrases occurring in the raw text of the survey responses and in the document headers.

This is done mainly with a quick implementation of [**Selva Prabhakaran**](https://github.com/selva86)'s 2018 tutorial [**Topic Modelling with Gensim**](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/), a powerful [**Natural Language Processing (NLP)**](https://www.ibm.com/cloud/learn/natural-language-processing) tool that provides a bird's-eye view into the survey data by extracting **keywords** which seem to comprise **high-level topics** of the survey.

Start by opening the Kaggle 2021 survey data, the 'responses.csv' document, in its raw text format.

In [None]:
warnings.filterwarnings('ignore')
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
c=int(len(df.columns))
n=int(len(df))
x=2500

df=df.fillna(0)
df=df.sample(n=x)

columns = list(df. columns)
df=df.astype(str)
df=df+"   "
df['period'] = df[columns].astype(str).sum(axis=1)
df = df[df.columns[df.columns.isin(['index','period'])]] 
dfz=df['period'][:1]
df.shape
df=df['period'].to_list()
df=[s.strip(' None ') for s in df]
df=[s.strip(' Never ') for s in df]
df=[s.strip(' Other ') for s in df]
dfz=[s.strip(' None ') for s in dfz]
dfz=[s.strip(' Never ') for s in dfz]
dfz=[s.strip(' Other ') for s in dfz]
df=[s.strip('$') for s in df]
df=[s.strip('*') for s in df]
dfz=[s.strip('$') for s in dfz]
dfz=[s.strip('*') for s in dfz]




from IPython.display import Markdown as md
md(str('<sub><sup>'+str(dfz)+'<sub><sup>'))

The output above is one row of the spreadsheet, the complete response-set for one respondent. The way we are working with this spreadsheet, each row is just a long string of raw text like this.

In [None]:

md(f"The *'responses.csv'* spreadsheet contains **{n}** observations across **{c}** columns, but we have joined all the columns into one, so we have **{n}** observations across only **one** column.")


We will use a random sample of **2500** observations for this first analysis task: making the topic map.

## Topic Map

To build the topic map we break up our text strings into words and word-groupings, assigning numeric values to each based on frequency of occurrence. 

With our words turned into numbers and a few [distribution assumptions](https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd), we assign topics to each word and *rearrange* (1) *the topics within each string*, and (2) *the words under each topic* “to obtain a good composition of topic-keywords distribution” (Prabhakaran, 2018) of our text collection (which, remember is comprised of the response-sets for each respondent in the 'responses.csv' document).

<span style="background-color: #F9F5AC">See below for the readout of keywords in our topic and their "weights", or relevance to the topic.</span>  

In [None]:

data = df
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
data = [re.sub("None", "", sent) for sent in data]
data = [re.sub("none", "", sent) for sent in data]
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=90000)
trigram = gensim.models.Phrases(bigram[data_words], threshold=90000)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['PROPN', 'NOUN', 'PART' 'X', 'CCONJ', 'ADV', 'VERB']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load("en_core_web_sm")
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['PROPN', 'NOUN', 'PART','X', 'CCONJ', 'ADV', 'VERB'])
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=2, 
                                           random_state=1,
                                           update_every=1,
                                           chunksize=1,
                                           passes=1,
                                           alpha='auto',
                                           per_word_topics=False)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
swt=str(lda_model.print_topics())
md(str((swt)))

**Storytelling Tools in the Toolbar** 

EDA data visualization tools are included in all spreadsheet applications, Who can live without quick bar charts pies, lines, and scatterplots? <span style="background-color: #F9F5AC"> Think of this next step as putting a new button up in your toolbar called <span style="color:blue">TOPIC MAP</span>.

The first part of our "Topic Map" pipeline was the preceding analysis. <span style="background-color: #F9F5AC">Next we visualize the weights and map them: we chart topic-keywords in a labelled scatterplot where y axis values are the topic weights, and x axis values are the taken differences of each y observation from the following y observation.</span> 

**Colors and K Means** 

The finishing touch of the topic map is clustering the keyword labels by performing a K-Means cluster analysis on the ranked values of x and y, assigning each value to one of five clusters. This cluster assignment just determines the color of the marker on the scatter plot and *is only meant to provide an intuitive sense of groupings in the chart data.*

<span style="background-color: #E4DDF3">We perform clustering analyses and assign the variables to k-mean clusters in order to color in the markers on our scatterplots throughout this notebook without much explanation.</span> <span style="background-color: #F9F5AC">We are effectively treating this option as another independent, new option in the EDA toolbar for scatterplots. It might be called <span style="color:red">CLUSTER COLOR</span>. 

Note: more-detailed usage and explanation of K-Means clustering is given in a later section.

In [None]:
%%capture
flip=re.findall(r'\b\d+\b',swt)
sp=pd.DataFrame(flip)
sp=sp.astype('int')
sp
sp= sp[sp[0] !=0]
spx=sp
spx
qwpe=pd.DataFrame(spx[0])
#qwpe=qwpe[-10:]
#qwpe=qwpe.reset_index()
qwpe
#sp=sp[-10:]
#str(quoted.findall(swt))[1]

import re
quoted = re.compile('"[^"]*"')
for i in range(10):
  sp.iloc[i]=str(quoted.findall(swt)[i])
  #if i is 1:
sp    

sp=sp[:10]
sp=sp.reset_index()
spx=spx.reset_index()
qwpe=qwpe.reset_index()
sp['ti']=qwpe[0]


sp['kfx']=qwpe[0]

import random as rn
edc=int(rn.randrange(1,9))


sp[0]
sp['kfi']=sp[0]

sp = sp[sp.columns[sp.columns.isin([0,'kfx'])]] 
sp[0]=pd.DataFrame(sp[0]).applymap(lambda x: x.replace('"', ''))
sp=sp.replace('machine', 'machine learning')
sp


#sp[0]
sp['kfx']=sp['kfx'].astype('int')*.001

float(sp['kfx'].loc[i]+(float(rn.randrange(1,9)))*.9)
sp['kx']=sp['kfx']
for i in range (int(len(sp['kfx']))):
  #sp['kx']=sp['kfx']
  sp['kx'].loc[i]=float(sp['kfx'].loc[i]+(float(rn.randrange(1,9)))*.001)
sp['cha']=sp['kx'].pct_change() * 1


sp=sp.bfill()
sp[:1]['cha']=float(sp[:1]['cha'])*.1+sp[:1]['cha']

sp['x'] = sp['cha'].rank()
sp['y'] = sp['kx'].rank()
val=sp[0]

# Convert to numpy
xs = sp[ 'x'].to_numpy()#np.random.randint( 0, 10, size=10)
ys = sp['y'].to_numpy()#np.random.randint(-5, 5,  size=10)
val=sp[0].to_numpy()
sp['topic']=sp[0]
sp['weight']=sp['kfx']
spii=sp[sp.columns[sp.columns.isin(['topic','weight'])]] 
spii.style.bar().hide_index()
boob = sp[sp.columns[sp.columns.isin(['x','y'])]] 
boob

mat = boob.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
labels = km.labels_
# Format results as a DataFrame
results = pd.DataFrame([boob.index,labels]).T
#erb=results[1].to_numpy()
sp['cluser']=results[1]

s=pd.DataFrame([i**2*2+2 for i in list(sp['cha'])])
s=s*-10
s=s.rank()
s=s*10

from matplotlib.font_manager import FontProperties
def maro():
  import matplotlib.pyplot as plt
  plt.rcParams['font.family'] = 'serif'
  plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']
  fig, ax = plt.subplots()
  ax.axis("off")
  cola=[0, 1, 2]
  groups = sp.groupby('cluser')
  colors={0:'SaddleBrown', 1:'Indigo', 2:'DarkOliveGreen', 3:'DarkGoldenrod', 4: 'FireBrick'}
  #plt.style.use('dark_background')

    
  plt.scatter(x=xs,y=ys, s=40,c=sp['cluser'].map(colors))#c='#E0DBFF')
  

  for x,y,z in zip(xs,ys,val):
    label = z
    
#for name, group in groups:
#    plt.plot(group["X Value"], group["Y Value"], marker="o", linestyle="", label=name)
    # font from OS
    #hfont = {'fontname':'Wingdings'}

    #plt.title("Topics", 
   #           loc='Center', fontsize=26, **hfont)

    
    plt.title('Topic Keywords: 2021 Kaggle Survey Responses', y=1.12, fontsize=18)
    #plt.style.use('dark_background')
    plt.annotate(label, # this is the, text
                 (x,y), # these are the coordinates to position the label
                 textcoords="offset points", # how to position the text
                 size=12,
                 #doll={0:'red', 1:'blue', 2:'green', 3:'black', 4: 'yellow'},
                 color='grey',#sp['kfi'].map({0:'r', 1:'b', 2:'g', 3:'k', 4: 'o'}),
                 xytext=(-1,10), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center



    #plt.savefig('output.png', dpi=300)

#### Okay, we can now generate our topic map. 

In [None]:
maro()

#### Look what we have here: If someone out of the blue handed you, the reader a copy of the Kaggle survey response dataset and asked "what's the survey all about?", you could probably infer something about the lay of the land

Now of course we see, or rather don't see that  **there is nothing personal showing up in these initial results**. We can't tell a human story with just this, and of course it was meant to be a high level look. Now we go in deep.

## Look at another raw text-string:

In [None]:
warnings.filterwarnings('ignore')
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
c=int(len(df.columns))
n=int(len(df))
x=2500

df=df.fillna(0)
df=df.sample(n=x)

columns = list(df. columns)
df=df.astype(str)
df=df+"   "
df['period'] = df[columns].astype(str).sum(axis=1)
df = df[df.columns[df.columns.isin(['index','period'])]] 
dfz=df['period'][:1]
df.shape
df=df['period'].to_list()
df=[s.strip(' None ') for s in df]
df=[s.strip(' Never ') for s in df]
df=[s.strip(' Other ') for s in df]
dfz=[s.strip(' None ') for s in dfz]
dfz=[s.strip(' Never ') for s in dfz]
dfz=[s.strip(' Other ') for s in dfz]
df=[s.strip('$') for s in df]
df=[s.strip('*') for s in df]
dfz=[s.strip('$') for s in dfz]
dfz=[s.strip('*') for s in dfz]


from IPython.display import Markdown as md
md(str(str(dfz)))

## <span style="background-color: #F9F5AC"> There are some personal labels there, obviously, but the majority of the data-points we're working with are technical terms and names of corporate entities, products, and platforms (not to mention a ton of zeros). 

## <span style="background-color: #F9F5AC"> To <span style="color:blue">break into this data and find the humanity</span>, we will "explode" the dataset with a special encoding technique, thereby making human traits more readily seen by our machine. 

# 2. All data can explode

### One-hot Encoding

One-hot encoding is a data pre-processing technique that can help machines see things in the data they might not otherwise. With one-hot encoding, every column's value... 

> ... is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. - https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

This creates a far larger and more complex dataset.

In [None]:
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
dfw=df.fillna(0)
new_header = dfw.iloc[0] 
dfw = dfw[1:] 
dfw.columns = new_header
dfcow=dfw['Duration (in seconds)']
dfw=dfw.drop(['Duration (in seconds)'], axis=1)
dfw=dfw.astype('str')
dfw=pd.get_dummies(dfw)
dfw.astype('float')
md(f"The one-hot-encoded dataframe is big indeed. **{str(int(len(dfw.columns)-int(len(df.columns))))}** columns were added for total of **{str(int(len(dfw.columns)))}** columns across the **{int(len(dfw))}** observations. ")

After the encoding, I visualize a map of 1's of 0's. Below are 500 rows of dataset as 500 x 968 grid of 1's and 0's. Dataset is **25973** rows, **968** columns after one-hot encoding.

In [None]:
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
dfw=df.fillna(0)
new_header = dfw.iloc[0] 
dfw = dfw[1:] 
dfw.columns = new_header
dfcow=dfw['Duration (in seconds)']
dfw=dfw.drop(['Duration (in seconds)'], axis=1)
dfw=dfw.astype('str')
dfw=pd.get_dummies(dfw)
dfw.astype('float')



zsl=dfw[:500]
zs=zsl.values#dfw.values

# IMAGE

from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,7)
#plt.subplot(211)
plt.imshow(zs)
#plt.subplot(212)
plt.imshow(zs, cmap='bone',  interpolation='nearest')
plt.axis('off')
plt.rcParams["figure.figsize"] = (20,20)
#plt.show()

#### I find visualizations of one-hot encoded data beautiful, they remind me of DNA staining.

![](https://upload.wikimedia.org/wikipedia/commons/7/76/RAPD-profiles-of-genomic-DNA-isolated-from-the-leaves-of-Mentha-arvensis-seedlings-after-12-days-of-Hg-treatment-along-w.jpg)

## k-means again

With the one-hot encoded data in hand, we assign each of the **968** columns to one of five clusters with another k-means analysis. This is a bigger operation than the <span style="color:red">CLUSTER COLOR</span> button we built earlier using k-means and warrants more discussion on clustering analysis generally and k-means specifically: 

**Generally**...

> ...**clustering** aims at forming subsets or groups within a dataset consisting of data points which are really similar to each other and the groups or subsets or clusters formed can be significantly differentiated from each other.

**Specifically**...

> **K-Means** Clustering is an Unsupervised Learning algorithm, used to group the unlabeled dataset into different clusters...


The K-Means clustering algorithm detect similarity and difference in the structure of the dataset and groups the variables (our **968** columns) according to the structure. The diagram below from Wikipedia explains the general idea of clustering better than words.    

Now imagine the simple diagram below has **968** dots getting split inside five clusters instead of three. That's roughly what we're about to do now. 


<a href="url"><img src="https://upload.wikimedia.org/wikipedia/commons/8/88/FactorAnalysis_ConceptualModel_DotsRings.png" align="center" height="280" width="280" ></a>

Performing even large k-means analysis with Python is simple in practice and done in a few lines of code. The fundamental operation used next is downright laconic:

$$ kmeans = KMeans(nclusters=5).fit(df) $$

Let us execute. 

Now see below: a dataframe is generated from this analysis with five rows, one for each cluster in the model, for all **968** columns.

* The rows contain the "loadings" of each column for the cluster, how strongly the column is "in" each cluster. 

* Each column in the response data is "loaded" on each of the clusters at some amount over zero. 

#### **Columns** are **assigned** to clusters according to their **maximum values**, shown shaded in grey. 

In [None]:
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
df=df.fillna(0)
new_header = df.iloc[0] 
df = df[1:] 
df.columns = new_header
dfco=df['Duration (in seconds)']
df=df.drop(['Duration (in seconds)'], axis=1)
df=df.astype('str')
df=pd.get_dummies(df)
df.astype('str')
dat1=df
dfx=pd.DataFrame(dat1.columns.to_list())
dfx=dfx.T
kmeans = KMeans(n_clusters=5).fit(df)
centroids = kmeans.cluster_centers_
pd.set_option('precision',10)
dfg=pd.DataFrame(centroids)
dfg=dfg.T
#dfg=dfg.rank()
dfg=dfg.T
dfg
yesp=dfg.T
yesp['-']=dfx.T
yesp.index=yesp['-']
yesp=yesp.drop(['-'], axis=1).T
yesp
yesper=yesp
index=['cluster 1', 'cluster 2', 'cluster 3', 'cluster 4', 'cluster 5']
yesper.index=index
sdea=yesper.iloc[: , -5:]
sdea.style.highlight_max(color = 'lightgrey', axis = 0)

#### The five-row sample below transposes the rows and columns of the table above. 

It is meant to show the structure of the data: a row of cluster assignments, a row of weights, and labels (original columns names) on the left.  

In [None]:
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
df=df.fillna(0)
new_header = df.iloc[0] 
df = df[1:] 
df.columns = new_header
dfco=df['Duration (in seconds)']
df=df.drop(['Duration (in seconds)'], axis=1)
df=df.astype('str')
df=pd.get_dummies(df)
df.astype('str')
dat1=df
dfx=pd.DataFrame(dat1.columns.to_list())
dfx=dfx.T
kmeans = KMeans(n_clusters=5).fit(df)
centroids = kmeans.cluster_centers_
pd.set_option('precision',10)
dfg=pd.DataFrame(centroids)
dfg=dfg.T
#dfg=dfg.rank()
dfg=dfg.T
dfg
yesp=dfg.T
yesp['-']=dfx.T
yesp.index=yesp['-']
yesp=yesp.drop(['-'], axis=1).T
yesp
yesper=yesp
index=['cluster i', 'cluster ii', 'cluster iii', 'cluster iv', 'cluster v']
yesper.index=index
yesper.style.highlight_max(color = 'lightgrey', axis = 0)

dfg1=pd.DataFrame(dfg.idxmax()).T
dfg1.head()

dfg2=pd.DataFrame(dfg.max()).T
dfg2

dfg3=dfg1.append(dfg2)
dfg3

dfg3=dfg3.append(dfx)
dfg3


dfg3=dfg3.T
dfg3

#dfg3
dfg3.columns=['Cluster Assigned','Model Weights', 'Survey Q & A']
dfg3['Cluster Assigned']=dfg3['Cluster Assigned'].astype(int)




dfg4=dfg3#.sample(10)
#dfg4.style.hide_index()
cols = list(dfg4.columns)
cols = [cols[-1]] + cols[:-1]
dfg4 = dfg4[cols]
dfgyu=dfg4[-5:]

dfgyu.style.hide_index()

#### Every label in the 'Survey Q & A' column is associated with the cluster where it loaded strongest, where it showed the highest model weight. 

### Our columns have been **clustered**, and we've flipped the table axes so column names are now the row labels. 

Now to charting the Model Weights for each cluster:   

In [None]:
cvrb=4#rn.randint(1,5)-1
is_2002 =  dfg4['Cluster Assigned']==cvrb
#print('CLUSTER __  ',  cvrb)
gapminder_2002 = dfg4[is_2002]
#gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
#gapminder_2002.head(30)





gapminder_2002=gapminder_2002.reset_index()
#gapminder_2002
#gapminder_2002['Model Weights'] = gapminder_2002['Model Weights'].apply(lambda x: round(x, decimals))


y = np.array(gapminder_2002['Model Weights'].to_list())
#y
n_breaks=5
model = rpt.Dynp(model="l2")
model.fit(y)
breaks = model.predict(n_bkps=n_breaks-1)
breaks_rpt = []
for i in breaks:
    breaks_rpt.append(gapminder_2002.index[i-1])
    
#gapminder_2002['Model Weights'].plot()

breaks_rpt

# -
plt.rcParams["figure.figsize"] = (8,6)
fig, ax = plt.subplots()
#ax.plot(range(10))

fig.patch.set_visible(True)
ax.axis('off')

ax.spines['bottom'].set_visible(True)
plt.plot(y, color='k',label='sorted weights')
plt.title('Unsorted Cluster Weights | Cluster ' + str(cvrb+1), y=1.12, fontsize=18)

print_legend=True


        
#plt.grid()

#plt.legend()

In [None]:
def exy(): 
  is_2002 =  dfg4['Cluster Assigned']==0
  gapminder_2002 = dfg4[is_2002]
  #gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps1.csv')
  y1 = np.array(gapminder_2002['Model Weights'].to_list())
  x1=np.arange(0, int(len(y1)), 1)
  np.savetxt('y1.txt', y1)
  np.savetxt('x1.txt', x1)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9

  is_2002 =  dfg4['Cluster Assigned']==1
  gapminder_2002 = dfg4[is_2002]
  ##gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps2.csv')
  y2 = np.array(gapminder_2002['Model Weights'].to_list())
  x2=np.arange(0, int(len(y2)), 1)
  np.savetxt('y2.txt', y2)
  np.savetxt('x2.txt', x2)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9

  is_2002 =  dfg4['Cluster Assigned']==2
  gapminder_2002 = dfg4[is_2002]
  #gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps3.csv')
  y3 = np.array(gapminder_2002['Model Weights'].to_list())
  x3=np.arange(0, int(len(y3)), 1)
  np.savetxt('y3.txt', y3)
  np.savetxt('x3.txt', x3)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9
  is_2002 =  dfg4['Cluster Assigned']==3
  gapminder_2002 = dfg4[is_2002]
  #gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps4.csv')
  y4 = np.array(gapminder_2002['Model Weights'].to_list())
  x4=np.arange(0, int(len(y4)), 1)
  np.savetxt('y4.txt', y4)
  np.savetxt('x4.txt', x4)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9

  is_2002 =  dfg4['Cluster Assigned']==4
  gapminder_2002 = dfg4[is_2002]
  #gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps5.csv')

  y5 = np.array(gapminder_2002['Model Weights'].to_list())
  x5=np.arange(0, int(len(y5)), 1)
  np.savetxt('y5.txt', y5)
  np.savetxt('x5.txt', x5)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9
exy()

x0 = np.loadtxt('x1.txt', dtype=float)
x1 = np.loadtxt('x2.txt', dtype=float)
x2 = np.loadtxt('x3.txt', dtype=float)
x3 = np.loadtxt('x4.txt', dtype=float)
x4 = np.loadtxt('x5.txt', dtype=float)
y0= np.loadtxt('y1.txt', dtype=float)
y1 = np.loadtxt('y2.txt', dtype=float)
y2 = np.loadtxt('y3.txt', dtype=float)
y3 = np.loadtxt('y4.txt', dtype=float)
y4 = np.loadtxt('y5.txt', dtype=float)
print_legend=True
y = np.array(gapminder_2002['Model Weights'].to_list())
#y
n_breaks=5
model = rpt.Dynp(model="l2")

    
def ben():
  f, (ax1, ax2,ax3, ax4, ax5) = plt.subplots(1,5,figsize=(17,2.4))


  model.fit(y0)
  breaks0 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt0 = []
  for i in breaks0:
    breaks_rpt0.append(x0[i-1])
  blak=pd.DataFrame(breaks_rpt0)
  awr=pd.read_csv('gaps1.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks0.csv')
  ax1.plot(x0, y0,c='k',linestyle='dashed')
  for i in breaks_rpt0:
      if print_legend:
          ax1.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax1.axvline(i, color='white')



  model.fit(y1)
  breaks1 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt1 = []
  for i in breaks1:
    breaks_rpt1.append(x1[i-1])
  blak=pd.DataFrame(breaks_rpt1)
  awr=pd.read_csv('gaps2.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks1.csv')
  ax2.plot(x1, y1,c='k',linestyle='dashed')
  for i in breaks_rpt1:
      if print_legend:
          ax2.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax2.axvline(i, color='white')


  model.fit(y2)
  breaks2 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt2 = []
  for i in breaks2:
    breaks_rpt2.append(x2[i-1])
  blak=pd.DataFrame(breaks_rpt2)
  awr=pd.read_csv('gaps3.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    
    color='white'
  
  blak.to_csv('blecks2.csv')
  ax3.plot(x2, y2,c='k',linestyle='dashed')
  for i in breaks_rpt2:
      if print_legend:
          ax3.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax3.axvline(i, color='white')

  model.fit(y3)
  breaks3 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt3 = []
  for i in breaks3:
    breaks_rpt3.append(x3[i-1])
  blak=pd.DataFrame(breaks_rpt3)
  awr=pd.read_csv('gaps4.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks3.csv')
  ax4.plot(x3, y3,c='k',linestyle='dashed')
  for i in breaks_rpt3:
      if print_legend:
          ax4.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax4.axvline(i, color='white')


  model.fit(y4)
  breaks4 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt4 = []
  for i in breaks4:
    breaks_rpt4.append(x4[i-1])
  blak=pd.DataFrame(breaks_rpt4)
  awr=pd.read_csv('gaps5.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks4.csv')
  ax5.plot(x4, y4,c='k',linestyle='dashed')
  for i in breaks_rpt4:
      if print_legend:
          ax5.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax5.axvline(i, color='white')

  ax1.axis('off')
  ax2.axis('off')
  ax3.axis('off')
  ax4.axis('off')
  ax5.axis('off')

  ax1.set_title('Cluster ' + 'One', y=1.1,  fontsize=14)
  ax2.set_title('Cluster ' + 'Two', y=1.1,  fontsize=14)
  ax3.set_title('Cluster ' + 'Three', y=1.1,  fontsize=14)
  ax4.set_title('Cluster ' + 'Four', y=1.1,  fontsize=14)
  ax5.set_title('Cluster ' + 'Five', y=1.1,  fontsize=14)
ben()

### Now sort each of them. 

In [None]:
def exy(): 
  is_2002 =  dfg4['Cluster Assigned']==0
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps1.csv')
  y1 = np.array(gapminder_2002['Model Weights'].to_list())
  x1=np.arange(0, int(len(y1)), 1)
  np.savetxt('y1.txt', y1)
  np.savetxt('x1.txt', x1)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9

  is_2002 =  dfg4['Cluster Assigned']==1
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps2.csv')
  y2 = np.array(gapminder_2002['Model Weights'].to_list())
  x2=np.arange(0, int(len(y2)), 1)
  np.savetxt('y2.txt', y2)
  np.savetxt('x2.txt', x2)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9

  is_2002 =  dfg4['Cluster Assigned']==2
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps3.csv')
  y3 = np.array(gapminder_2002['Model Weights'].to_list())
  x3=np.arange(0, int(len(y3)), 1)
  np.savetxt('y3.txt', y3)
  np.savetxt('x3.txt', x3)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9
  is_2002 =  dfg4['Cluster Assigned']==3
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps4.csv')
  y4 = np.array(gapminder_2002['Model Weights'].to_list())
  x4=np.arange(0, int(len(y4)), 1)
  np.savetxt('y4.txt', y4)
  np.savetxt('x4.txt', x4)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9

  is_2002 =  dfg4['Cluster Assigned']==4
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps5.csv')

  y5 = np.array(gapminder_2002['Model Weights'].to_list())
  x5=np.arange(0, int(len(y5)), 1)
  np.savetxt('y5.txt', y5)
  np.savetxt('x5.txt', x5)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    
    ydycn=9
exy()

x0 = np.loadtxt('x1.txt', dtype=float)
x1 = np.loadtxt('x2.txt', dtype=float)
x2 = np.loadtxt('x3.txt', dtype=float)
x3 = np.loadtxt('x4.txt', dtype=float)
x4 = np.loadtxt('x5.txt', dtype=float)
y0= np.loadtxt('y1.txt', dtype=float)
y1 = np.loadtxt('y2.txt', dtype=float)
y2 = np.loadtxt('y3.txt', dtype=float)
y3 = np.loadtxt('y4.txt', dtype=float)
y4 = np.loadtxt('y5.txt', dtype=float)
print_legend=True

def ben():
  f, (ax1, ax2,ax3, ax4, ax5) = plt.subplots(1,5,figsize=(17,2.4))


  model.fit(y0)
  breaks0 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt0 = []
  for i in breaks0:
    breaks_rpt0.append(x0[i-1])
  blak=pd.DataFrame(breaks_rpt0)
  awr=pd.read_csv('gaps1.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks0.csv')
  ax1.plot(x0, y0,c='k',linestyle='dashed')
  for i in breaks_rpt0:
      if print_legend:
          ax1.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax1.axvline(i, color='white')



  model.fit(y1)
  breaks1 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt1 = []
  for i in breaks1:
    breaks_rpt1.append(x1[i-1])
  blak=pd.DataFrame(breaks_rpt1)
  awr=pd.read_csv('gaps2.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks1.csv')
  ax2.plot(x1, y1,c='k',linestyle='dashed')
  for i in breaks_rpt1:
      if print_legend:
          ax2.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax2.axvline(i, color='white')


  model.fit(y2)
  breaks2 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt2 = []
  for i in breaks2:
    breaks_rpt2.append(x2[i-1])
  blak=pd.DataFrame(breaks_rpt2)
  awr=pd.read_csv('gaps3.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    
    color='white'
  
  blak.to_csv('blecks2.csv')
  ax3.plot(x2, y2,c='k',linestyle='dashed')
  for i in breaks_rpt2:
      if print_legend:
          ax3.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax3.axvline(i, color='white')

  model.fit(y3)
  breaks3 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt3 = []
  for i in breaks3:
    breaks_rpt3.append(x3[i-1])
  blak=pd.DataFrame(breaks_rpt3)
  awr=pd.read_csv('gaps4.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks3.csv')
  ax4.plot(x3, y3,c='k',linestyle='dashed')
  for i in breaks_rpt3:
      if print_legend:
          ax4.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax4.axvline(i, color='white')


  model.fit(y4)
  breaks4 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt4 = []
  for i in breaks4:
    breaks_rpt4.append(x4[i-1])
  blak=pd.DataFrame(breaks_rpt4)
  awr=pd.read_csv('gaps5.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='white'
    #display(int(len(awq)))
  else:
    color='white'
  blak.to_csv('blecks4.csv')
  ax5.plot(x4, y4,c='k',linestyle='dashed')
  for i in breaks_rpt4:
      if print_legend:
          ax5.axvline(i, color='white')
          #print_legend = False
      else:
          #print_legend = True 
          ax5.axvline(i, color='white')

  ax1.axis('off')
  ax2.axis('off')
  ax3.axis('off')
  ax4.axis('off')
  ax5.axis('off')

  ax1.set_title('Cluster ' + 'One', y=1.1,  fontsize=14)
  ax2.set_title('Cluster ' + 'Two', y=1.1,  fontsize=14)
  ax3.set_title('Cluster ' + 'Three', y=1.1,  fontsize=14)
  ax4.set_title('Cluster ' + 'Four', y=1.1,  fontsize=14)
  ax5.set_title('Cluster ' + 'Five', y=1.1,  fontsize=14)
ben()

**Slice up** the list of sorted weights using [**breakpoint detection**](https://towardsdatascience.com/getting-started-with-breakpoints-analysis-in-python-124471708d38), a technique sometimes used in time-series analyses. 

In [None]:
def exy(): 
  is_2002 =  dfg4['Cluster Assigned']==0
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps1.csv')
  y1 = np.array(gapminder_2002['Model Weights'].to_list())
  x1=np.arange(0, int(len(y1)), 1)
  np.savetxt('y1.txt', y1)
  np.savetxt('x1.txt', x1)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    yyysdv=8#')

  is_2002 =  dfg4['Cluster Assigned']==1
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps2.csv')
  y2 = np.array(gapminder_2002['Model Weights'].to_list())
  x2=np.arange(0, int(len(y2)), 1)
  np.savetxt('y2.txt', y2)
  np.savetxt('x2.txt', x2)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    yyysdv=8#')


  is_2002 =  dfg4['Cluster Assigned']==2
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps3.csv')
  y3 = np.array(gapminder_2002['Model Weights'].to_list())
  x3=np.arange(0, int(len(y3)), 1)
  np.savetxt('y3.txt', y3)
  np.savetxt('x3.txt', x3)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    yyysdv=8#')

  is_2002 =  dfg4['Cluster Assigned']==3
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps4.csv')
  y4 = np.array(gapminder_2002['Model Weights'].to_list())
  x4=np.arange(0, int(len(y4)), 1)
  np.savetxt('y4.txt', y4)
  np.savetxt('x4.txt', x4)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    yyysdv=8#')

  is_2002 =  dfg4['Cluster Assigned']==4
  gapminder_2002 = dfg4[is_2002]
  gapminder_2002=gapminder_2002.sort_values(by=['Model Weights'])
  gapminder_2002=gapminder_2002.reset_index()
  gapminder_2002.to_csv('gaps5.csv')

  y5 = np.array(gapminder_2002['Model Weights'].to_list())
  x5=np.arange(0, int(len(y5)), 1)
  np.savetxt('y5.txt', y5)
  np.savetxt('x5.txt', x5)
  try:
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    vw=int(vw[0])
    np.savetxt('fee.txt', vw)

  except:
    yyysdv=8#')
exy()

x0 = np.loadtxt('x1.txt', dtype=float)
x1 = np.loadtxt('x2.txt', dtype=float)
x2 = np.loadtxt('x3.txt', dtype=float)
x3 = np.loadtxt('x4.txt', dtype=float)
x4 = np.loadtxt('x5.txt', dtype=float)
y0= np.loadtxt('y1.txt', dtype=float)
y1 = np.loadtxt('y2.txt', dtype=float)
y2 = np.loadtxt('y3.txt', dtype=float)
y3 = np.loadtxt('y4.txt', dtype=float)
y4 = np.loadtxt('y5.txt', dtype=float)
print_legend=True

def ben():
  f, (ax1, ax2,ax3, ax4, ax5) = plt.subplots(1,5,figsize=(17,2.4))


  model.fit(y0)
  breaks0 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt0 = []
  for i in breaks0:
    breaks_rpt0.append(x0[i-1])
  blak=pd.DataFrame(breaks_rpt0)
  awr=pd.read_csv('gaps1.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='grey'
    #display(int(len(awq)))
  else:
    color='grey'
  blak.to_csv('blecks0.csv')
  ax1.plot(x0, y0,c='k',linestyle='dashed')
  for i in breaks_rpt0:
      if print_legend:
          ax1.axvline(i, color='grey')
          #print_legend = False
      else:
          #print_legend = True 
          ax1.axvline(i, color='grey')



  model.fit(y1)
  breaks1 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt1 = []
  for i in breaks1:
    breaks_rpt1.append(x1[i-1])
  blak=pd.DataFrame(breaks_rpt1)
  awr=pd.read_csv('gaps2.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='grey'
    #display(int(len(awq)))
  else:
    color='grey'
  blak.to_csv('blecks1.csv')
  ax2.plot(x1, y1,c='k',linestyle='dashed')
  for i in breaks_rpt1:
      if print_legend:
          ax2.axvline(i, color='grey')
          #print_legend = False
      else:
          #print_legend = True 
          ax2.axvline(i, color='grey')


  model.fit(y2)
  breaks2 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt2 = []
  for i in breaks2:
    breaks_rpt2.append(x2[i-1])
  blak=pd.DataFrame(breaks_rpt2)
  awr=pd.read_csv('gaps3.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='grey'
    #display(int(len(awq)))
  else:
    
    color='grey'
  
  blak.to_csv('blecks2.csv')
  ax3.plot(x2, y2,c='k',linestyle='dashed')
  for i in breaks_rpt2:
      if print_legend:
          ax3.axvline(i, color='grey')
          #print_legend = False
      else:
          #print_legend = True 
          ax3.axvline(i, color='grey')

  model.fit(y3)
  breaks3 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt3 = []
  for i in breaks3:
    breaks_rpt3.append(x3[i-1])
  blak=pd.DataFrame(breaks_rpt3)
  awr=pd.read_csv('gaps4.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='grey'
    #display(int(len(awq)))
  else:
    color='grey'
  blak.to_csv('blecks3.csv')
  ax4.plot(x3, y3,c='k',linestyle='dashed')
  for i in breaks_rpt3:
      if print_legend:
          ax4.axvline(i, color='grey')
          #print_legend = False
      else:
          #print_legend = True 
          ax4.axvline(i, color='grey')


  model.fit(y4)
  breaks4 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt4 = []
  for i in breaks4:
    breaks_rpt4.append(x4[i-1])
  blak=pd.DataFrame(breaks_rpt4)
  awr=pd.read_csv('gaps5.csv')
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    #print('goteem')
    color='grey'
    #display(int(len(awq)))
  else:
    color='grey'
  blak.to_csv('blecks4.csv')
  ax5.plot(x4, y4,c='k',linestyle='dashed')
  for i in breaks_rpt4:
      if print_legend:
          ax5.axvline(i, color='grey')
          #print_legend = False
      else:
          #print_legend = True 
          ax5.axvline(i, color='grey')

  ax1.axis('off')
  ax2.axis('off')
  ax3.axis('off')
  ax4.axis('off')
  ax5.axis('off')

  ax1.set_title('Cluster ' + 'One', y=1.1,  fontsize=14)
  ax2.set_title('Cluster ' + 'Two', y=1.1,  fontsize=14)
  ax3.set_title('Cluster ' + 'Three', y=1.1,  fontsize=14)
  ax4.set_title('Cluster ' + 'Four', y=1.1,  fontsize=14)
  ax5.set_title('Cluster ' + 'Five', y=1.1,  fontsize=14)
ben()

## ABOVE: We have five clusters, each sliced into five segments. 

### That gives us **25** lists of survey items like 'what is your age', 'what is your gender', 'where do you live.' There are **968** total items, and each of our  25 lists is made up of a different number of them. 
    
### <span style="background-color: #E1E2FF"> We are betting the items in each list will be related to each other, since they are in the same cluster and the same cluster-segment. 
    
### <span style="background-color: #E1E2FF"> We are also betting these lists of related items will point toward the human element: interesting Kaggle user-groups we can study and talk about

# 3. Gender cloud    

We could use any number of strategies now to select clusters for our search and study. We'll narrow our present scope and look at only slices that contain a 'gender' response label; every label occurs only once in our data so maximum four segments out of the 25 segments will contain a gender label. 

#### Let's briefly see some **labels** found within the **cluster segments** where **gender labels** appear. But first please note:
* <span style="background-color: #F9F5AC"> The gender of the group we decide to study in the end is **Women**. 
* **Men** were excluded from next analyses using keyword charts. Under current specifications of the study the label 'Man' is associated only with the labels 'Python', 'Matplotlib', and 'Scikit-learn'. Future work must revisit.

#### First we examine the segment in which Nonbinary gendered responses loaded strongest

See below: the sorted and segmented cluster is line-plotted and annotated to indicate the location of the **Nonbinary** gender label's position in the data.

In [None]:
def non():

  for i in range(5):
    is_2002 =  dfg4['Cluster Assigned']==i
    gapminder_2002 = dfg4[is_2002]
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Nonbinary'].tolist()
    try:
      vw=int(vw[0]+1)
      if vw >0:
        #dummyhere=1111#(vw)
        with open('assigned_cluster_indexer.txt', 'w') as f:
          f.write('%d' % vw)
        with open('assigned_cluster.txt', 'w') as f:
          f.write(str(int(gapminder_2002['Cluster Assigned'][-1:])))
      else:
        #dummyhere=1111#('here')
        skdjf=11
    except:
      dummyhere=1111#('Not here')
        #np.savetxt('fee.txt',vw)#, vw)



      #rwey=awr['Model Weights']
      #rwey=rwey.iloc[vw]
  #ader=pd.read_csv('/content/assigned_cluster.txt')
  f = open('assigned_cluster.txt')
  kawa=f.read()
  kawa4gapper=int(kawa)+1
  yer=str('y')+str(kawa)
  xer=str('x')+str(kawa)
  #yer=str('y')+str(kawa)
  breaks_rpter=str('breaks_rpt')+str(kawa)
  gapper=str('gaps')+str(kawa4gapper)+'.csv'
  gapper
  #yer=eval(yer)
  xer=eval(xer)
  yer=eval(yer)
  #breaks_rpter=eval(breaks_rpter)

  model.fit(yer)
  breaks3 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt3 = []
  for i in breaks3:
    breaks_rpt3.append(xer[i-1])
  blak=pd.DataFrame(breaks_rpt3)
  awr=pd.read_csv(gapper)
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Nonbinary'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    color='grey'
    #display(' ')#(int(len(awq)))
  else:
    color='grey'
  blak.to_csv('blecks4.csv')
  fig, ax5 = plt.subplots(figsize=(10,2.4))
  
  ax5.plot(xer, yer,c='grey',linestyle='dashed')
  ax5.axis("off")



  vw=awr.index[awr['Survey Q & A'] == 'What is your gender? - Selected Choice_Nonbinary'].tolist()
  vw=int(vw[0])
  rwey=awr['Model Weights']
  rwey=rwey.iloc[vw]

  #ax5.axis('off')
  rweye=round(rwey,6)
  ax5.annotate('NonBinary,  '+str(rweye),
            xy=(vw, rwey),
            xytext=(4,.8),
            #textcoords='offset points',
            arrowprops=dict(arrowstyle='->', color='lightgrey'),
            va='center',
            ha='left',
            fontsize=13)

  #ax5.plot(x1, y1,c='k',linestyle='dashed')


  for i in breaks_rpt3:
      if print_legend:
          ax5.axvline(i, color='lightgrey',ymin=.01,ymax=.5)
      else:
          ax5.axvline(i, color='lightgrey')




    # the bleks spreadsheets are the breaks

  blecks=pd.read_csv('blecks4.csv')

  # so list out 
  breaks_rpt=list(blecks['0'])
  breaks_rpt
  #make a dataframe containing in two rows. 
  breaks_rpt=pd.DataFrame(breaks_rpt)
  breaks_rpt.index=breaks_rpt[0]
  breaks_rpt['goose']=breaks_rpt[0]
  breaks_rpt
  # the gaps spreadsheets are the items
  gapminder_2002=pd.read_csv(gapper)

  #transpose the dataframe and rename the columns
  ad=breaks_rpt.T
  ad
  # Create SECTION's labels in 'T' column
  ad.columns=['1ST', 'SECOND','THIRD', 'FOURTH', 'FIFTH']
  ad=ad.T
  ad['T']=ad.index
  ad.index=ad['goose']
  #display(' ')#(ad)

  # add the labels to each observation of the 'gaps' sheet
  df_merged = gapminder_2002.merge(ad, how='outer', left_index=True, right_index=True)

  df_merged=df_merged.fillna(method='backfill')

  df_merged.head(3)


# isolate a cluster title as 'x'
  i=0
  if i is 0:
    x='1ST'
  else:
    if i is 1:
      x='SECOND'
    else:
      if i is 2:
        x='THIRD'
      else:
        if i is 3:
          x='FOURTH'
        else:
          if i is 4:
            x='FIFTH'
  #display(' ')#(x)

  # Filter down to only that section
  is_2002 =  df_merged['T']==x


  dummyhere=1111#('Section ', x)

  df_merged = df_merged[is_2002]
  df_merged.head(3)



  df_merged['NAM']=df_merged['Survey Q & A']
  df_merged = df_merged[~df_merged['NAM'].str.contains("_0")]
  df_merged = df_merged[~df_merged['NAM'].str.contains("_None")]
  df_merged = df_merged[~df_merged['NAM'].str.contains(" None")]
  df_merged = df_merged[~df_merged['NAM'].str.contains("_other")]
  #df_merged = df_merged[~df_merged['NAM'].str.contains("prefer to self-describe")]
  df_merged = df_merged[~df_merged['NAM'].str.contains("Other")]
  df_merged['NAM']=df_merged['NAM'].str.replace('_0', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('_', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('Selected Choice', '')
  df_merged['NAM']=df_merged['NAM'].str.replace(' - ', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('(Select all that apply)', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('?', '|')
  df_merged['NAM']=df_merged['NAM'].str.replace(':', '|')
  df_merged['NAM']=df_merged['NAM'].str.replace('(', '')
  df_merged['NAM']=df_merged['NAM'].str.replace(')', '')



  qor=pd.DataFrame([x.split('|')[:12] for x in df_merged['NAM']])
  qor[-3:]


  try:
    qor['Desired'] = (qor[1].str.split()
                                  .apply(lambda x: OrderedDict.fromkeys(x).keys())
                                  .str.join(' '))
  except: 
    qor['Desired']=qor[1]



  try:
    df_merged=df_merged.reset_index() 
  except:
    pass
  df_merged['lean']=qor['Desired']

  sp=df_merged
  sp[-3:]




  sp['kfx']=sp['Model Weights']
  sp['cha']=sp['kfx'].pct_change() * 1
  sp=sp.bfill()
  sp[:1]['cha']=float(sp[:1]['cha'])*.1+sp[:1]['cha']
  sp=sp.bfill()
  sp[-3:]



  # x and y variables the chart,  a list of the labels for the chart
  sp['x'] = sp['cha'].rank()
  sp['y'] = sp['kfx'].rank()
  val=sp['lean']



  #check if a string is present in what you are analyzing
  if str(val.isin(['Woman'])) is 'True':
    yyy=4
  else:
    yyy=5

  # convert variables to numpy arrays

  xs = sp['x'].to_numpy()#np.random.randint( 0, 10, size=10)
  ys = sp['y'].to_numpy()#np.random.randint(-5, 5,  size=10)
  val=sp['lean'].to_numpy() 

  reu=sp['lean'].isin(['Woman'])
  reu=str(reu)
  fullstring = reu
  substring = "True"

  if substring in fullstring:
      dummyhere=1111#("Found!")
      sp['lean'].to_csv('reu.csv')
  else:
      dummyhere=1111#("Not found!")



#rename everything for presentatiom 

  sp['topic']=sp['lean']
  sp['weight']=sp['kfx']


  # spii is used uniquely in the maro function 
  spii=sp[sp.columns[sp.columns.isin(['topic','weight'])]] 
  spii=round(2/(int(len(spii)))*160)
  rt=spii
  #rt=6
  if rt <6:  
    rt=6
    dummyhere=1111#('rt')
  else:
    dummyhere=1111#('already there')

  if rt >12:
    rt=12
    dummyhere=1111#('12')
  else:
    dummyhere=1111#('ok')
  spii=rt
  spii
  #######


  boob = sp[sp.columns[sp.columns.isin(['x','y'])]] 
  boob



  #take the boob values and fit them onto clusters (not sure if this is necessary or added later in order to color the chart but forgot to folllow up)
  mat = boob.values
  # Using sklearn
  km = sklearn.cluster.KMeans(n_clusters=5)
  km.fit(mat)
  labels = km.labels_
  # Format results as a DataFrame
  results = pd.DataFrame([boob.index,labels]).T
  #erb=results[1].to_numpy()
  sp['cluser']=results[1]
  sp
  ax.axis("off")


  # s = ranekd weights ... THIS IS THE WORD SIZE
  s=pd.DataFrame([i**2*2+2 for i in list(sp['cha'])])
  s=s*-10
  s=s.rank()
  s=s*10


  tyui=i#('CLUSTER __  ',  cvrb)
  tyur=x#('SECTION __', x)


  dummyhere=1111#(' Certainly... ... ... ')
  dummyhere=1111#(' Certainly... ... ... ')
  dummyhere=1111#(' Certainly... ... ... ')







  def maro():
    import matplotlib.pyplot as plt
    plt.rcParams['font.family'] = 'serif'
    plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

    fig, ax = plt.subplots(figsize=(10.4,6))
    ax.axis("off")
    cola=[0, 1, 2]
    groups = sp.groupby('cluser')
    colors={0:'SaddleBrown', 1:'Indigo', 2:'DarkOliveGreen', 3:'DarkGoldenrod', 4: 'FireBrick'}
    
    plt.scatter(x=xs,y=ys, s=40,c=sp['cluser'].map(colors))#sp['cluser'].map(colors))#c='#E0DBFF')
    ax.axis("off")


    
    for x,y,z in zip(xs,ys,val):
      label = z
      
  #for name, group in groups:
  #    plt.plot(group["X Value"], group["Y Value"], marker="o", linestyle="", label=name)
      # font from OS
    

      #plt.title("Topics", 
    #           loc='Center', fontsize=26, **5hfont)
    
      plt.suptitle('Above: Cluster segment contains label Nonbinary', y=1.1,fontsize=19)
      plt.title('Below: Keyword cloud for cluster segment containing label Nonbinary', y=1.05,fontsize=19)

 



      tyur
      plt.annotate(label, # this is the, text
                  (x,y), # these are the coordinates to position the label5
                  textcoords="offset points", # how to position the text
                  size=spii,
                  xytext=(-1,10), # distance from text to points (x,y)
                  ha='center') # horizontal alignment can be left, right or center
                  

  maro()


Using the model weights as y-axis values, we'll use the same charting techniques we did with the <span style="color:blue">TOPIC MAP</span> earlier, including a brand new k-means analysis to color the chart marks using the <span style="color:red">CLUSTER COLOR</span> technique. 

In [None]:
non()

#### <span style="background-color: #E1E2FF">Some rather interesting stories in this keyword cloud!

* UPPER LEFT: Kagglers who are Statisticians, Project Managers and Public Service workers holding High School diplomas in Russia and the Middle East. 

* LOWER LEFT: Developer Relations and Advocacy work in Thailand and Ireland sounds like an interesting subject for Journalism and Social Science research.

* Military/Security/Defense in... 

**...all very interesting.**

In [None]:
# -  Woman

def women():

  for i in range(5):
    is_2002 =  dfg4['Cluster Assigned']==i
    gapminder_2002 = dfg4[is_2002]
    vw=gapminder_2002.index[gapminder_2002['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
    try:
      vw=int(vw[0]+1)
      if vw >0:
        dummyhere=1111#(vw)
        with open('assigned_cluster_indexer.txt', 'w') as f:
          f.write('%d' % vw)
        with open('assigned_cluster.txt', 'w') as f:
          f.write(str(int(gapminder_2002['Cluster Assigned'][-1:])))
      else:
        dummyhere=1111#('here')
    except:
      dummyhere=1111#('Not here')
        #np.savetxt('fee.txt',vw)#, vw)



      #rwey=awr['Model Weights']
      #rwey=rwey.iloc[vw]
  #ader=pd.read_csv('/content/assigned_cluster.txt')
  f = open('assigned_cluster.txt')
  kawa=f.read()
  kawa4gapper=int(kawa)+1
  yer=str('y')+str(kawa)
  xer=str('x')+str(kawa)
  #yer=str('y')+str(kawa)
  breaks_rpter=str('breaks_rpt')+str(kawa)
  gapper=str('gaps')+str(kawa4gapper)+'.csv'
  gapper
  #yer=eval(yer)
  xer=eval(xer)
  yer=eval(yer)
  #breaks_rpter=eval(breaks_rpter)

  model.fit(yer)
  breaks3 = model.predict(n_bkps=n_breaks-1)
  breaks_rpt3 = []
  for i in breaks3:
    breaks_rpt3.append(xer[i-1])
  blak=pd.DataFrame(breaks_rpt3)
  awr=pd.read_csv(gapper)
  awq=awr['Survey Q & A'].isin(['What is your gender? - Selected Choice_Woman', 'What is your gender? - Selected Choice_Woman'])
  awq=awq.index[awq == True].tolist()
  if int(len(awq)) >0:
    color='grey'
    #display(' ')#(int(len(awq)))
  else:
    color='grey'
  blak.to_csv('blecks4.csv')
  fig, ax5 = plt.subplots(figsize=(10,2.4))
  
  ax5.plot(xer, yer,c='grey',linestyle='dashed')
  ax5.axis("off")



  vw=awr.index[awr['Survey Q & A'] == 'What is your gender? - Selected Choice_Woman'].tolist()
  vw=int(vw[0])
  rwey=awr['Model Weights']
  rwey=rwey.iloc[vw]

  #ax5.axis('off')
  rweye=round(rwey,2)
  ax5.annotate('♀,  '+str(rweye),
            xy=(vw, rwey),
            xytext=(4,.8),
            #textcoords='offset points',
            arrowprops=dict(arrowstyle='->', color='lightgrey'),
            va='center',
            ha='left',
            fontsize=13)

  #ax5.plot(x1, y1,c='k',linestyle='dashed')


  for i in breaks_rpt3:
      if print_legend:
          ax5.axvline(i, color='lightgrey',ymin=.01,ymax=.5)
      else:
          ax5.axvline(i, color='lightgrey')




    # the bleks spreadsheets are the breaks

  blecks=pd.read_csv('blecks4.csv')

  # so list out 
  breaks_rpt=list(blecks['0'])
  breaks_rpt
  #make a dataframe containing in two rows. 
  breaks_rpt=pd.DataFrame(breaks_rpt)
  breaks_rpt.index=breaks_rpt[0]
  breaks_rpt['goose']=breaks_rpt[0]
  breaks_rpt
  # the gaps spreadsheets are the items
  gapminder_2002=pd.read_csv(gapper)

  #transpose the dataframe and rename the columns
  ad=breaks_rpt.T
  ad
  # Create SECTION's labels in 'T' column
  ad.columns=['1ST', 'SECOND','THIRD', 'FOURTH', 'FIFTH']
  ad=ad.T
  ad['T']=ad.index
  ad.index=ad['goose']
  #display(' ')#(ad)

  # add the labels to each observation of the 'gaps' sheet
  df_merged = gapminder_2002.merge(ad, how='outer', left_index=True, right_index=True)

  df_merged=df_merged.fillna(method='backfill')

  df_merged.head(3)


# isolate a cluster title as 'x'
  i=0
  if i is 0:
    x='1ST'
  else:
    if i is 1:
      x='SECOND'
    else:
      if i is 2:
        x='THIRD'
      else:
        if i is 3:
          x='FOURTH'
        else:
          if i is 4:
            x='FIFTH'
  #display(' ')#(x)

  # Filter down to only that section
  is_2002 =  df_merged['T']==x


  dummyhere=1111#('Section ', x)

  df_merged = df_merged[is_2002]
  df_merged.head(3)



  df_merged['NAM']=df_merged['Survey Q & A']
  df_merged = df_merged[~df_merged['NAM'].str.contains("_0")]
  df_merged = df_merged[~df_merged['NAM'].str.contains("_None")]
  df_merged = df_merged[~df_merged['NAM'].str.contains(" None")]
  df_merged = df_merged[~df_merged['NAM'].str.contains("_other")]
  #df_merged = df_merged[~df_merged['NAM'].str.contains("prefer to self-describe")]
  df_merged = df_merged[~df_merged['NAM'].str.contains("Other")]
  df_merged['NAM']=df_merged['NAM'].str.replace('_0', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('_', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('Selected Choice', '')
  df_merged['NAM']=df_merged['NAM'].str.replace(' - ', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('(Select all that apply)', '')
  df_merged['NAM']=df_merged['NAM'].str.replace('?', '|')
  df_merged['NAM']=df_merged['NAM'].str.replace(':', '|')
  df_merged['NAM']=df_merged['NAM'].str.replace('(', '')
  df_merged['NAM']=df_merged['NAM'].str.replace(')', '')



  qor=pd.DataFrame([x.split('|')[:12] for x in df_merged['NAM']])
  qor[-3:]


  try:
    qor['Desired'] = (qor[1].str.split()
                                  .apply(lambda x: OrderedDict.fromkeys(x).keys())
                                  .str.join(' '))
  except: 
    qor['Desired']=qor[1]



  try:
    df_merged=df_merged.reset_index() 
  except:
    pass
  df_merged['lean']=qor['Desired']

  sp=df_merged
  sp[-3:]




  sp['kfx']=sp['Model Weights']
  sp['cha']=sp['kfx'].pct_change() * 1
  sp=sp.bfill()
  sp[:1]['cha']=float(sp[:1]['cha'])*.1+sp[:1]['cha']
  sp=sp.bfill()
  sp[-3:]



  # x and y variables the chart,  a list of the labels for the chart
  sp['x'] = sp['cha'].rank()
  sp['y'] = sp['kfx'].rank()
  val=sp['lean']



  #check if a string is present in what you are analyzing
  if str(val.isin(['Woman'])) is 'True':
    yyy=4
  else:
    yyy=5

  # convert variables to numpy arrays

  xs = sp['x'].to_numpy()#np.random.randint( 0, 10, size=10)
  ys = sp['y'].to_numpy()#np.random.randint(-5, 5,  size=10)
  val=sp['lean'].to_numpy() 

  reu=sp['lean'].isin(['Woman'])
  reu=str(reu)
  fullstring = reu
  substring = "True"

  if substring in fullstring:
      #dummyhere=1111#("Found!")
      sp['lean'].to_csv('reu.csv')
  else:
      dummyhere=1111#("Not found!")



#rename everything for presentatiom 

  sp['topic']=sp['lean']
  sp['weight']=sp['kfx']


  # spii is used uniquely in the maro function 
  spii=sp[sp.columns[sp.columns.isin(['topic','weight'])]] 
  spii=round(2/(int(len(spii)))*160)
  rt=spii
  #rt=6
  if rt <6:  
    rt=6
    #dummyhere=1111#('rt')
  else:
    dummyhere=1111#('already there')

  if rt >12:
    rt=12
    #dummyhere=1111#('12')
  else:
    dummyhere=1111#('ok')
  spii=rt
  spii
  #######


  boob = sp[sp.columns[sp.columns.isin(['x','y'])]] 
  boob



  #take the boob values and fit them onto clusters (not sure if this is necessary or added later in order to color the chart but forgot to folllow up)
  mat = boob.values
  # Using sklearn
  km = sklearn.cluster.KMeans(n_clusters=5)
  km.fit(mat)
  labels = km.labels_
  # Format results as a DataFrame
  results = pd.DataFrame([boob.index,labels]).T
  #erb=results[1].to_numpy()
  sp['cluser']=results[1]
  sp
  ax.axis("off")


  # s = ranekd weights ... THIS IS THE WORD SIZE
  s=pd.DataFrame([i**2*2+2 for i in list(sp['cha'])])
  s=s*-10
  s=s.rank()
  s=s*10


  #tyui=('CLUSTER __  ',  cvrb)
  tyur=('SECTION __', x)







  def maro():
    import matplotlib.pyplot as plt
    plt.rcParams['font.family'] = 'serif'
    plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

    fig, ax = plt.subplots(figsize=(10.4,6))
    ax.axis("off")
    cola=[0, 1, 2]
    groups = sp.groupby('cluser')
    colors={0:'SaddleBrown', 1:'Indigo', 2:'DarkOliveGreen', 3:'DarkGoldenrod', 4: 'FireBrick'}
    
    plt.scatter(x=xs,y=ys, s=40,c=sp['cluser'].map(colors))#sp['cluser'].map(colors))#c='#E0DBFF')
    ax.axis("off")


    
    for x,y,z in zip(xs,ys,val):
      label = z
      
  #for name, group in groups:
  #    plt.plot(group["X Value"], group["Y Value"], marker="o", linestyle="", label=name)
      # font from OS
    

      #plt.title("Topics", 
    #           loc='Center', fontsize=26, **5hfont)
    
    
      plt.suptitle('Above: Cluster segment contains label Woman', y=1.1,fontsize=19)
      plt.title('Below: Keyword cloud for cluster segment containing label Woman', y=1.05,fontsize=19)

 



      tyur
      plt.annotate(label, # this is the, text
                  (x,y), # these are the coordinates to position the label5
                  textcoords="offset points", # how to position the text
                  size=spii,
                  xytext=(-1,10), # distance from text to points (x,y)
                  ha='center') # horizontal alignment can be left, right or center
                  

  maro()


#### Now in the segment containing the Women label, same as above, we sort and segment the cluster weights and then line-plot them, indicating on the chart the location of the Woman gender label in the data. 

Again too with the Model Weights as y-axis values, we'll use the charting techniques we used on the <span style="color:blue">TOPIC MAP</span>, including a brand new k-means analysis to color the chart marks using the <span style="color:red">CLUSTER COLOR</span> technique

In [None]:
women()

### The word cloud above shows young women, lower education, more use of four specific programming languages and living within a **specific area**. 

#### The map below shows just how specific a geography we are in now.

In [None]:
import matplotlib.pyplot as plt
import geopandas as gpd
# from descartes import PolygonPatch

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# or plot Africa continent
ax2 = world.plot(figsize=(8,8), edgecolor='lightgrey', color='white')

world[world.name == "Iran"].plot(edgecolor=u'gray', color='k', ax=ax2)
world[world.name == "Algeria"].plot(edgecolor=u'gray', color='k', ax=ax2)
world[world.name == "Tunisia"].plot(edgecolor=u'gray', color='k', ax=ax2)
world[world.name == "Morocco"].plot(edgecolor=u'gray', color='k', ax=ax2)
world[world.name == "China"].plot(edgecolor=u'grey', color='k', ax=ax2)

# the place to plot additional vector data (points, lines)

#plt.ylabel('Latitude')
#plt.xlabel('Longitude')

ax2.axis('off')
plt.show()


## <span style="background-color: #F9F5AC">Now this feels more close-up to people. It is the most personal result we've seen so far and it seems appropriate to take this as a starting point for our user group.  

## Fine. Young females, current and former university students, specific geography, three or four programming languages. <span style="background-color: #F9F5AC">Is that a story? 

### No but maybe we are closer...

Alas, looking for more details in the Kaggle data, filtering the data on gender, education, and country of residences to match our cluster - we almost run out of data to analyze.  

In [None]:
original_emport=pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
df=original_emport.fillna(0)
new_header = df.iloc[0] 
df = df[1:] 
df.columns = new_header
dfco=df['Duration (in seconds)']
df=df.astype('str')
df.astype('str')
dat1=df
dfx=pd.DataFrame(dat1.columns.to_list())
dfx=dfx.T
df.shape
df['What is your age (# years)?'] = df['What is your age (# years)?'].map(lambda x: x.lstrip('+').rstrip('+'))
df['What is your age (# years)?']=df['What is your age (# years)?'].str.replace('70', '70-80')
df[['A', 'B']] = df['What is your age (# years)?'].str.split('-', 1, expand=True)
df['A']=df['A'].astype('int')
df['B']=df['B'].astype('int')
df['age']=((df['A']+df['B'])/2)

df

man = df['What is your gender? - Selected Choice']!='Man'
df=df[man]
df



years = ['Morocco', 'China', 'Algeria', 'Tunisia', 'Iran, Islamic Republic of...']
years=df['In which country do you currently reside?'].isin(years)
df=df[years]
uslet=df
#df

#df['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].unique().tolist()




is_2002 =  df['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'] == 'Some college/university study without earning a bachelor’s degree'

df=df[is_2002]
#is_2002=df['What is your age (# years)?']=='18-21'
#df=df[is_2002]
#df


#df.head()

df['yeei']=df.index
glio=df['yeei']
#glio

#### Using the same charting method from earlier of all 1's and 0's we make the chart below. The unshaded portion of the chart is our sample at this stage, and we are not even done with the filtering down on labels. And again that chart y axis is only 500 observations out of the over 25000 we could plot.

In [None]:


# list of iloc

glio=list(glio)


#dfw=pd.read_csv('/content/drive/MyDrive/kaggle_survey_2021_responses.csv')
dfw=original_emport.fillna(0)
new_header = dfw.iloc[0] 
dfw = dfw[1:] 
dfw.columns = new_header
dfcow=dfw['Duration (in seconds)']
dfw=dfw.drop(['Duration (in seconds)'], axis=1)
dfw=dfw.astype('str')
dfw=pd.get_dummies(dfw)
dfw.astype('float')





for i in glio:
  dfw.iloc[i]=dfw.iloc[i]+1
  dfw.iloc[i]=dfw.iloc[i].replace(1,0)
  dfw = pd.concat([dfw.iloc[[i],:], dfw.drop(i, axis=0)], axis=0)

  #print(dfw.iloc[i])


#dfw.head()









zsl=dfw[:500]
zs=zsl.values#dfw.values

# IMAGE

from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,20)
#plt.subplot(211)
plt.imshow(zs)
#plt.subplot(212)
plt.imshow(zs, cmap='bone',  interpolation='nearest')
plt.axis('off')
plt.rcParams["figure.figsize"] = (20,20)
#plt.show()

## That tiny sample just won't do. Are we loosing momentum?

### Maybe not. Let's follow the thread out of here for a moment and into external datasets for clues and inspiration.




# 4. The four languages

## Why C, C++, Java, and Javascript? 

### <span style="background-color: #F9F5AC"> It may be notable that C, C++, Java, and Javascript are the only languages loading in same segment as 'Women'.  

See charts below, where we find geographic associations between **rates of use for (only) these four programming languages** (Kaggle sample) and two **global indicators** from the [World Bank](https://www.kaggle.com/mutindafestus/world-statistics-dataset-from-world-bank): **GDP** and
 **Percent Female Population working in agriculture by country**. <span style="background-color: #F9F5AC"> The **use rates** of nearly all 6 other **coding languages** show either inverted or flat associations. 

## Coding Language use rates  X  % of female employment in agriculture

In [None]:
from IPython.display import clear_output




def goah(thver):
    import pandas as pd
    df=pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv").fillna(0)
    new_header = df.iloc[0] 
    df = df[1:] 
    df.columns = new_header
    dfco=df['Duration (in seconds)']
    df=df.astype('str')
    df.astype('str')
    dat1=df
    dfx=pd.DataFrame(dat1.columns.to_list())
    dfx=dfx.T
    df['What is your age (# years)?'] = df['What is your age (# years)?'].map(lambda x: x.lstrip('+').rstrip('+'))
    df['What is your age (# years)?']=df['What is your age (# years)?'].str.replace('70', '70-80')
    df[['A', 'B']] = df['What is your age (# years)?'].str.split('-', 1, expand=True)
    df['A']=df['A'].astype('int')
    df['B']=df['B'].astype('int')
    df['age']=((df['A']+df['B'])/2)
    df['country']=df['In which country do you currently reside?']
    boop=list(df['country'])
    import country_converter as coco
    some_names =boop
    standard_names = coco.convert(names=some_names, to='name_short')
    #print(standard_names)
    df['trucou']=standard_names
    #df.index=df['trucou']
    connectthisone=df
    #df.head(60)
    df=1
    worlddataimport=pd.read_csv('../input/world-statistics-dataset-from-world-bank/data.csv')
    df=worlddataimport
    df['year']=df['year'].astype('str')
    years = ['2019']
    years=df['year'].isin(years)
    #years
    df=df[years]
    df=df.dropna(axis=1, how='all')
    #df
    #Country Name
    boop=list(df['Country Name'])
    some_names =boop
    standard_names = coco.convert(names=some_names, to='name_short')
    #print(standard_names)
    df['crucoun']=standard_names
    #df.index=df['trucou']
    #connectthisone=df
    #df.head(10)
    #df
    connectthistwo=df
    df1=connectthisone
    df2=connectthistwo
    langs=['Python', 'R','SQL' ,'C','C++','Java','Javascript' ,'Julia','Swift','Bash','MATLAB']
    #langs
    df3 = pd.merge(left=df1, right=df2, left_on='trucou', right_on='crucoun', how='outer')
    df3['trucou'] = df3.apply(lambda row: row['trucou'] if not pd.isnull(row['trucou']) else row['crucoun'], axis=1)
    df3=df3[:int(len(df1))]
    appear=langs
    # merge, using 'outer' to avoid losing records from either left or right
    df3 = pd.merge(left=df1, right=df2, left_on='trucou', right_on='crucoun', how='outer')
    # combining the columns used to match
    df3['trucou'] = df3.apply(lambda row: row['trucou'] if not pd.isnull(row['trucou']) else row['crucoun'], axis=1)
    # dropping the now spare column
    #df3 = df3.drop('crucoun', axis=1)
    #df3=df3[10:300]
    df3=df3[:int(len(df1))]
    #df3['op']=df3['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C']

    for i in range(int(len(langs))):

      apper=langs[i]

      df3[apper]=df3['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - '+ apper]
      df3[apper]=df3[apper].replace([apper], [1])
      df3[apper]=df3[apper].astype('int')

    df3['nmer']= df3['Employment in agriculture (% of total employment) (modeled ILO estimate)']
    #df3
    dst=df3.groupby(df3['trucou']).mean()
    import matplotlib
    from matplotlib.pyplot import figure as fig
    from pylab import rcParams
    #rcParams['figure.figsize'] = 9, 9
    #plt.style.use('dark_background')
    #dst
    #https://towardsdatascience.com/bring-your-jupyter-notebook-to-life-with-interactive-widgets-bc12e03f0916
    # List of options
    state_options = ['','Agricultural raw materials exports (% of merchandise exports)', 'Agricultural raw materials imports (% of merchandise imports)', 'Agriculture, forestry, and fishing, value added (% of GDP)', 'Agriculture, forestry, and fishing, value added (current US$)', 'Employment in agriculture (% of total employment) (modeled ILO estimate)', 'Employment in agriculture, female (% of female employment) (modeled ILO estimate)', 'Employment in agriculture, male (% of male employment) (modeled ILO estimate)', 'GDP per capita (current US$)', 'Literacy rate, adult total (% of people ages 15 and above)', 'Mortality rate, infant (per 1,000 live births)']
    #thver= 'Employment in agriculture, female (% of female employment) (modeled ILO estimate)'
    #thver
    varstr=thver
    y=dst[varstr]
    x1=dst['Python']
    x2=dst['R']
    x3=dst['SQL']
    x4=dst['C']
    x5=dst['C++']
    x6=dst['Java']
    x7=dst['Javascript']
    x8=dst['Julia']
    x9=dst['Swift']
    x10=dst['Bash']
    #x11=dst['MATLAB']

    plt.tight_layout
    clear_output(wait=True)


    def ben():
      
      plt.rcParams['font.family'] = 'serif'
      plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']
      f, (ax1, ax2,ax3,ax4,ax5) = plt.subplots(1,5,figsize=(15,3.0), constrained_layout=True, sharey=True)
      f.tight_layout(pad=3.0)
      f.suptitle('x-Axis = % language users | y-Axis = Variable: '+thver , fontsize=15,y=1.3)
    
      print('1')
      ax1.scatter(x1, y, c='w', s=6, edgecolor='grey')
      ax1.set_title('Python', size=19)
      ax2.scatter(x2, y, c='w', s=6, edgecolor='grey')
      ax2.set_title('R', size=19)
      ax3.scatter(x3, y, c='w', s=6,  edgecolor='grey')
      ax3.set_title('SQL', size=19)
      ax4.scatter(x4, y, c='w',  s=6, edgecolor='grey')
      ax4.set_title('C', size=19)
      ax5.scatter(x5, y, c='w',  s=6, edgecolor='grey')
      ax5.set_title('C++', size=19)
      print('2')

      N = len(x1)
      x1_mean = x1.mean()
      y_mean = y.mean()
      B1_num = ((x1 - x1_mean) * (y - y_mean)).sum()
      B1_den = ((x1 - x1_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x1_mean)
      ax1.plot(x1, B0 + B1*x1,  c='k', linewidth=.5, alpha=.9, solid_capstyle='round')


      N = len(x2)
      x2_mean = x2.mean()
      y_mean = y.mean()
      B1_num = ((x2 - x2_mean) * (y - y_mean)).sum()
      B1_den = ((x2 - x2_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x2_mean)
      ax2.plot(x2, B0 + B1*x2,  c='k', linewidth=.5, alpha=.9, solid_capstyle='round')


      N = len(x3)
      x3_mean = x3.mean()
      y_mean = y.mean()
      B1_num = ((x3 - x3_mean) * (y - y_mean)).sum()
      B1_den = ((x3 - x3_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x3_mean)
      ax3.plot(x3, B0 + B1*x3,  c='k', linewidth=.5, alpha=.9, solid_capstyle='round')

      N = len(x4)
      x4_mean = x4.mean()
      y_mean = y.mean()
      B1_num = ((x4 - x4_mean) * (y - y_mean)).sum()
      B1_den = ((x4 - x4_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x4_mean)
      ax4.plot(x4, B0 + B1*x4, linewidth=.5, alpha=.9, solid_capstyle='round')

      N = len(x5)
      x5_mean = x5.mean()
      y_mean = y.mean()
      B1_num = ((x5 - x5_mean) * (y - y_mean)).sum()
      B1_den = ((x5 - x5_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x5_mean)
      ax5.plot(x5, B0 + B1*x5, linewidth=.5, alpha=.9, solid_capstyle='round')

      print('3')
      ax1.spines['right'].set_visible(False)
      ax2.spines['right'].set_visible(False)
      ax3.spines['right'].set_visible(False)
      ax4.spines['right'].set_visible(False)
      ax5.spines['right'].set_visible(False)
      ax1.spines['left'].set_visible(False)
      ax2.spines['left'].set_visible(False)
      ax3.spines['left'].set_visible(False)
      ax4.spines['left'].set_visible(False)
      ax5.spines['left'].set_visible(False)
      ax1.spines['bottom'].set_visible(False)
      ax2.spines['bottom'].set_visible(False)
      ax3.spines['bottom'].set_visible(False)
      ax4.spines['bottom'].set_visible(False)
      ax5.spines['bottom'].set_visible(False)


      print('4')
      clear_output(wait=True)
    def duniy():

      plt.rcParams['font.family'] = 'serif'
      plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

      f, (ax6,ax7,ax8,ax9,ax10) = plt.subplots(1,5,figsize=(15,3), constrained_layout=True, sharey=True)
      f.tight_layout(pad=3.0)

      ax6.scatter(x6, y, c='w',  s=11, edgecolor='grey')
      ax6.set_title('Java', size=19)
      #ax6.axes.yaxis.set_visible(False)
      ax7.scatter(x7, y, c='w',  s=11, edgecolor='grey')
      ax7.set_title('Javascript', size=19)
      ax8.scatter(x8, y, c='w',  s=11, edgecolor='grey')
      ax8.set_title('Julia', size=19)
      ax9.scatter(x9, y, c='w',  s=11, edgecolor='grey')
      ax9.set_title('Swift', size=19)
      ax10.scatter(x10, y, c='w',  s=11, edgecolor='grey')
      ax10.set_title('Bash', size=19)
      #ax11.scatter(x11, y, c='w',  s=6, edgecolor='w')
      #ax11.set_title('% MATLAB', size=12)


      N = len(x6)
      x6_mean = x6.mean()
      y_mean = y.mean()
      B1_num = ((x6 - x6_mean) * (y - y_mean)).sum()
      B1_den = ((x6 - x6_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x6_mean)
      ax6.plot(x6, B0 + B1*x6, linewidth=.5, alpha=.9, solid_capstyle='round')

      N = len(x7)
      x7_mean = x7.mean()
      y_mean = y.mean()
      B1_num = ((x7 - x7_mean) * (y - y_mean)).sum()
      B1_den = ((x7 - x7_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x7_mean)
      ax7.plot(x7, B0 + B1*x7, linewidth=.5, alpha=.9, solid_capstyle='round')


      N = len(x8)
      x8_mean = x8.mean()
      y_mean = y.mean()
      B1_num = ((x8 - x8_mean) * (y - y_mean)).sum()
      B1_den = ((x8 - x8_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x8_mean)
      ax8.plot(x8, B0 + B1*x8,  c='k', linewidth=.5, alpha=.9, solid_capstyle='round')


      N = len(x9)
      x9_mean = x9.mean()
      y_mean = y.mean()
      B1_num = ((x9 - x9_mean) * (y - y_mean)).sum()
      B1_den = ((x9 - x9_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x9_mean)
      ax9.plot(x9, B0 + B1*x9, c='k', linewidth=.5, alpha=.9, solid_capstyle='round')

      N = len(x10)
      x10_mean = x10.mean()
      y_mean = y.mean()
      B1_num = ((x10 - x10_mean) * (y - y_mean)).sum()
      B1_den = ((x10 - x10_mean)**2).sum()
      B1 = B1_num / B1_den
      B0 = y_mean - (B1*x10_mean)
      ax10.plot(x10, B0 + B1*x10, c='k',linewidth=.5, alpha=.9, solid_capstyle='round')

      ax6.spines['right'].set_visible(False)
      ax7.spines['right'].set_visible(False)
      ax8.spines['right'].set_visible(False)
      ax9.spines['right'].set_visible(False)
      ax10.spines['right'].set_visible(False)
      ax6.spines['left'].set_visible(False)
      ax7.spines['left'].set_visible(False)
      ax8.spines['left'].set_visible(False)
      ax9.spines['left'].set_visible(False)
      ax10.spines['left'].set_visible(False)
      ax6.spines['bottom'].set_visible(False)
      ax7.spines['bottom'].set_visible(False)
        
      ax8.spines['bottom'].set_visible(False)
      ax9.spines['bottom'].set_visible(False)
      ax10.spines['bottom'].set_visible(False)
    
      
    clear_output(wait=True)
    display(ben())
    clear_output(wait=True)
    
    duniy()
goah('Employment in agriculture, female (% of female employment) (modeled ILO estimate)')

## Coding Language use rates  X  GDP by country 

In [None]:
goah('GDP per capita (current US$)')

### <span style="background-color: #F9F5AC"> Among Kaggle users, higher rates of use of  these four languages is associated with residence in lower GDP-countries, and in countries with higher rates of female employment in agriculture. 

### <span style="background-color: #F9F5AC"> Among all languages in the dataset only these four languages seem to track these discrepancies.  

The evidence is directional, certainly. Still, how's this for a developing story: 

#### <span style="background-color: #E1E2FF"> We might have here a blurry image of a small number of Woman scattered throughout a vast but known geography, similar though they have likely never met. They use the four programming languages discussed above the most. Lower GDP means higher likelihood of poverty. Personal connections to agricultural workers more likely for them. 

Are these the Kaggle users of today? Yes this is a tiny sliver of them. 

#### <span style="background-color: #E1E2FF"> And the Kaggle users of Tomorrow? Perhaps this anaylsis is directing us toward where they might come from, and the programming language preferences they might arrive with. Check back.

# 5. A conclusion inside cities

Let's now compare [Google search](https://trends.google.com/trends/?geo=US) rates for **coding languages** in 2020-2021 with [average rents and disposable incomes](https://www.kaggle.com/blitzr/movehub-city-rankings) in 37 cities worldwide. 

I have consolidated the data and added latitude and longitude columns, [access this dataset here](https://www.kaggle.com/daanderson/programming-language-search-trends-by-city-2021). Plotting:

In [None]:
import pandas as pd

import matplotlib.pyplot as plt

import geopandas as gpd
from shapely.geometry import Point

import folium
from folium import Marker, GeoJson
from folium.plugins import MarkerCluster, HeatMap

wc = gpd.read_file("../input/programming-language-search-trends-by-city-2021/xworddata.csv")
import pandas as pd
from shapely.geometry import Point
import geopandas
from geopandas import GeoDataFrame
import geoplot

df = wc
df['Long']=df['Long'].astype('float')
df['Lat']=df['Lat'].astype('float')


geometry = [Point(xy) for xy in zip(df['Long'], df['Lat'])]
gdf = GeoDataFrame(df, geometry=geometry)   

#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

gdf.plot(ax=world.plot(color='white', edgecolor='lightgrey',figsize=(10, 6)), marker='o', color='k', markersize=15);



**Now** We run another of our "simple" k-means analyses on the coding language use-rates in the cities in the dataset. Then we plot average rents in these cities against average spending money by city, and color the dots corresponding to the cluster the city was assigned in the k-means analysis.              

In [None]:


df = pd.read_csv("../input/programming-language-search-trends-by-city-2021/xworddata.csv", low_memory=False)
belor=df['City']
#df=df.drop(['City', ], axis=1)
df=df.drop(['Pop','City', 'Long', 'Lat', 'Movehub Rating', 'Purchase Power', 'Python: (11/18/20 - 11/18/21).1', 'Price of a Coffee', 'Price of a Movie', 'Wine Price', 'Gas Price', 'Health Care','Pollution','Quality of Life','Crime Rating','Avg Rent','Avg Disposable Income'], axis=1)
df=df.astype('int')
deeper=df
df=df.T
df=df.rank()
df=df.T
df=df.rank()

#df['City']=belor
#df.T
df
blist=list(df.columns)
from sklearn.preprocessing import StandardScaler
features = blist
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
df['City']=belor
y = df.loc[:,['City']].values
#y = belor.values
# Standardizing the features
x = StandardScaler().fit_transform(x)

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler



kmeans = KMeans(
    init="random",
    n_clusters=5,
    #n_init=10,
    n_init=30000,
    #max_iter=30,
    max_iter=5000000,
    random_state=42
    )

zx=kmeans.fit(x)
df = pd.read_csv("../input/programming-language-search-trends-by-city-2021/xworddata.csv", low_memory=False)

bx=zx.fit_predict(x)

breex=pd.DataFrame(bx,columns=['a'])

breex['Avg Disposable Income']=df['Avg Disposable Income'].rank()+1
breex['Avg Rent']=df['Avg Rent'].rank()+1
breex['Quality of Life']=df['Quality of Life'].rank()+1
breex.index=df['City']

breex=breex.sort_values(by=['a'], ascending=True)

colors={0:'lightgreen', 1:'lightblue', 2:'yellow', 3:'black', 4: 'pink'}

    
breex.style.bar(subset=['Avg Disposable Income', 'Quality of Life'])#, color=breex['a'].map(colors))
x=breex['Avg Disposable Income'].to_numpy()
y=breex['Avg Rent'].to_numpy()
poliyu=list(breex.index)
plt.rcParams["figure.figsize"] = (12,8)
plt.axis('off')
plt.scatter(x=x,y=y, s=100,color=breex['a'].map(colors))
#plt.text(11,12,'sfd', )
ax.legend(colors)
for i in range(len(x)):
    plt.annotate(poliyu[i], (x[i], y[i] + 0.2))

## **Notice above**: The k-means analysis and the scatter are of course completely independent analyses, yet code-language clusters indicated by colors seem aligned with the wealth inequality gradation across these cities (x axis=incomes, y axis=rents). <span style="background-color: #F9F5AC"> An association between wealth broadly speaking and Google searches for specific coding languages is observed here across cities, globally. 

Our (still blurry) picture is: some small number of Women across vastly bounded geography, similar but they have not met. They tend to have more connections to agricultural workers and they use one or more of the four programming languages discussed here. For those living in cities, their city of residence's relative wealth on a global scale may predict which coding languages they regularly use. Generally lower GDP for countries of residence, so more poverty in sight.      

*It now befalls the reader to judge where we've ended up. I have tried to tell a Kaggle user-group's story with data in a new and interesting way. Thank you for your time and attention.*

$$D e e, Portland. 11/28/2021 $$

# References and resources

Exploratory data analysis

* https://en.wikipedia.org/wiki/Exploratory_data_analysis
* https://en.wikipedia.org/wiki/Scatter_plot

Natural Language Processing

* https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
* author: https://github.com/selva86
* https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd

One-hot encoding

* https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

Breakpoint detection

* https://towardsdatascience.com/getting-started-with-breakpoints-analysis-in-python-124471708d38

K-Means

* https://www.analyticsvidhya.com/blog/2020/12/a-detailed-introduction-to-k-means-clustering-in-python/


Datasets:

* Kaggle
* https://www.kaggle.com/c/kaggle-survey-2021/data

* Worldbank
* https://www.kaggle.com/mutindafestus/world-statistics-dataset-from-world-bank

* https://data.worldbank.org/

* Google Search + Movehub
* https://www.kaggle.com/daanderson/programming-language-search-trends-by-city-2021
* https://www.kaggle.com/blitzr/movehub-city-rankings      
* https://trends.google.com/trends/?geo=US




