# OC PROJET 5 - AUTOMATICALLY CATEGORIZE QUESTIONS
#### CLEANING AND ANALYSIS NOTEBOOK
<br></br>
### SOMMAIRE
- <a href="#C1">I. Nettoyage des données</a>
    
- <a href="#C2">II. Feature Engineering</a>
    
- <a href="#C3">III. Exploration des données</a>
    - 1. Matrice des corrélations
    - 2. Analyse temporelle
    - 3. Analyse Quanti/Quanti
    - 4. Analyse Quanti/Quali
    - 5. Analyse Quali/Quali
    - 6. ACP

<font size="5">1. Importation des librairies</font>

In [1]:
# importation des librairies
import os
import numpy as np
import pandas as pd
import matplotlib as mpl
from matplotlib import font_manager as rcParams
import matplotlib.patheffects as path_effects
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
from scipy.stats import pearsonr
from scipy.stats import f_oneway
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

<font size="5">2. Paramétrages Data Visualisation</font>

In [2]:
# Ajouter une ombre à la police
shadow = path_effects.withSimplePatchShadow(offset = (1, - 0.75), 
shadow_rgbFace = 'darkblue', alpha = 0.25)

# changer la police dans les graphiques, les couleurs 
# et augmenter la résolution d'affichage
plt.rcParams['font.family'] = 'Ebrima'
plt.rcParams['text.color'] = 'white'
plt.rcParams['figure.dpi'] = 200
plt.rcParams['savefig.dpi'] = 200
plt.style.use('dark_background')

# set le theme seaborn
sns.set_style('darkgrid', {'axes.facecolor': '0.2',
'text.color': 'white', 'figure.figsize': (20, 16)})
plt.rcParams['figure.facecolor'] = '0.2'

# suppression de l'affichage max des colonnes
pd.set_option('display.max_columns', None)

### SQL REQUEST CODE 

SELECT TOP 10000000 Title, Body, Tags, Id, Score, ViewCount, AnswerCount,
CreationDate, LastActivityDate, CommentCount

FROM Posts 

WHERE PostTypeId = 1 AND ViewCount > 100 AND Score > 3 AND AnswerCount > 0 
AND LEN(Tags) - LEN(REPLACE(Tags, '<','')) >= 5 AND CommentCount > 0

# <a name="C2">I. Nettoyage des données</a>

In [11]:
df = pd.read_csv('QueryResults.csv')

In [4]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount
0,"JQuery - AJAX dialog modal, can't hit enter ke...","<p>On a website I'm working on, when you click...",<javascript><jquery><ajax><jquery-ui><enter>,1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3
1,How do I do large non-blocking updates in Post...,<p>I want to do a large update on a table in P...,<postgresql><transactions><sql-update><plpgsql...,1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4
2,Manually setting session ID in Express,<p>I have an Angular/Express app and am trying...,<javascript><angularjs><node.js><express><pass...,20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2
3,Include Google Maps API Key in open source pro...,<p>Is it okay to put your Google Maps API Key ...,<security><api><open-source><google-maps><publ...,1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1
4,Emacs ido-style shell,<p>Is there a command line shell or shell cust...,<bash><shell><emacs><eshell><ido>,1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5
...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.L...,<p>My IQueryable line is:</p>\n\n<pre><code> /...,<c#><asp.net-mvc><linq-to-sql><list><iqueryable>,2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1
49996,How can I gzip my JavaScript and CSS files?,"<p>I have a problem, I have to gzip a prototyp...",<javascript><css><apache><http><gzip>,2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1
49997,how to know location of return address on stac...,<p>i have been reading about a function that c...,<c++><c><winapi><x86><stack>,2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2
49998,Is it possible to use Firebase Cloud Messaging...,<p>I wonder if it is possible to use firebase ...,<ios><swift><firebase><apple-push-notification...,40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2


In [13]:
pd.set_option('display.max_colwidth', None)

df['Body'].head()

0    <p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you have to click on OK to submit the form, you can't just hit enter.  I need it to be able to have enter work also.  It seems like what I have should work, but it doesn't</p>\n\n<p>I'm using jquery-1.3.2.js. I also have a php file with the following piece of code in it: `</p>\n\n<pre><code>  &lt;tr valign="top" align="right" style="height:40px"&gt;&lt;td &gt;\n\n    &lt;div id="signin"&gt;\n\n      &lt;table style="margin-top:4px;margin-right:4px;border-style:solid;border-width:1px"&gt;\n\n        &lt;tr&gt;&lt;td style="width:165px;"&gt;  \n\n            &lt;div&gt;&lt;center&gt;\n\n            &lt;a title="Sign In" onclick="LoginDialogOpen()" href="javascript:void();"&gt;Sign In&lt;/a&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; | &amp;nbsp;&amp;nbsp;&lt;/b&gt;\n\n            &lt;a title="Create Account" href="CreateAccount.html"&gt;Create Account&lt;/a&gt;\n\n            &lt;/center&gt;&lt;/div&gt;

In [14]:
df['Body_nobalise'] = df['Body'].str.replace(r'<code>.*?</code>', '', regex=True)

In [15]:
df['Body_nobalise'].head()

0    <p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you have to click on OK to submit the form, you can't just hit enter.  I need it to be able to have enter work also.  It seems like what I have should work, but it doesn't</p>\n\n<p>I'm using jquery-1.3.2.js. I also have a php file with the following piece of code in it: `</p>\n\n<pre><code>  &lt;tr valign="top" align="right" style="height:40px"&gt;&lt;td &gt;\n\n    &lt;div id="signin"&gt;\n\n      &lt;table style="margin-top:4px;margin-right:4px;border-style:solid;border-width:1px"&gt;\n\n        &lt;tr&gt;&lt;td style="width:165px;"&gt;  \n\n            &lt;div&gt;&lt;center&gt;\n\n            &lt;a title="Sign In" onclick="LoginDialogOpen()" href="javascript:void();"&gt;Sign In&lt;/a&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; | &amp;nbsp;&amp;nbsp;&lt;/b&gt;\n\n            &lt;a title="Create Account" href="CreateAccount.html"&gt;Create Account&lt;/a&gt;\n\n            &lt;/center&gt;&lt;/div&gt;

In [16]:
# Fonction de nettoyage pour enlever les balises de code
def nettoyer_code(texte):
    # Utiliser une expression régulière pour trouver les balises de code
    pattern = r'<code>(.*?)</code>|<pre><code>(.*?)</code></pre>'
    matches = re.findall(pattern, texte)
    
    # Supprimer les balises de code et retourner le texte nettoyé
    for match in matches:
        code = match[0] or match[1]  # Sélectionner le premier groupe de capture non vide
        texte = texte.replace(match[0], code).replace(match[1], code)
    
    return texte

In [17]:
df['body_clean'] = df['Body'].apply(nettoyer_code)

MemoryError: 

In [5]:
df.isnull().sum()

Title               0
Body                0
Tags                0
Id                  0
Score               0
ViewCount           0
AnswerCount         0
CreationDate        0
LastActivityDate    0
CommentCount        0
dtype: int64

In [6]:
df.loc[df.duplicated(keep = False),:]

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount


In [7]:
df.dtypes

Title               object
Body                object
Tags                object
Id                   int64
Score                int64
ViewCount            int64
AnswerCount          int64
CreationDate        object
LastActivityDate    object
CommentCount         int64
dtype: object

In [8]:
df.describe()

Unnamed: 0,Id,Score,ViewCount,AnswerCount,CommentCount
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,26200700.0,28.04252,26830.93,3.39866,3.68868
std,19303420.0,194.846305,139791.1,4.026946,3.349398
min,4.0,4.0,101.0,1.0,1.0
25%,9724780.0,5.0,1945.0,1.0,1.0
50%,22996860.0,7.0,5350.0,2.0,3.0
75%,38857220.0,14.0,15471.75,4.0,5.0
max,76343880.0,25632.0,12760110.0,134.0,51.0


In [100]:
# Fonction pour enlever la balise <p>
def remove_html_tags(text):
    clean = re.compile('<.*?>|\n')
    return re.sub(clean, '', text)

In [9]:
def preprocess_text(text):
    # Supprimer les balises HTML
    text = re.sub('<.*?>', '', text)
    
    # Convertir en minuscules
    text = text.lower()
    
    # Supprimer la ponctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Supprimer les mots vides (stop words)
    stop_words = set(stopwords.words('english'))
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatisation des mots
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Rejoindre les tokens prétraités en une seule chaîne de texte
    processed_text = ' '.join(tokens)
    
    return processed_text

In [10]:
df['body_preprocess'] = df['Body'].apply(preprocess_text)

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\Oliver/nltk_data'
    - 'C:\\Users\\Oliver\\anaconda3\\nltk_data'
    - 'C:\\Users\\Oliver\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\Oliver\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Oliver\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [83]:
df['Body'].head()

0    website im working click sign jquery dialoge modal pop click ok submit form cant hit enter need able enter work also seems like work doesnt im using jquery132js also php file following piece code lttr valigntop alignright styleheight40pxgtlttd gt ltdiv idsigningt lttable stylemargintop4pxmarginright4pxborderstylesolidborderwidth1pxgt lttrgtlttd stylewidth165pxgt ltdivgtltcentergt lta titlesign onclicklogindialogopen hrefjavascriptvoidgtsign inltagtltbgtampnbspampnbsp ampnbspampnbspltbgt lta titlecreate account hrefcreateaccounthtmlgtcreate accountltagt ltcentergtltdivgt lttdgtlttrgt lttablegt ltdivgt lttdgtlttrgt ltdiv idsignin_dialog gt ltdiv idbggt ltlabelgtltspangtemailltspangtltlabelgt ltinput typetext nameemail idemail classdialoginputtextgt ltbrgt ltlabelgtltspangtpasswordltspangtltlabelgt ltinput typepassword namepassword idpassword classdialoginputtextgt ltbrgt ltbrgt ltcentergtltbgtltlabel idlogin_error stylecolorredgtltspangtampnbspltspangtltlabelgtltcentergtltbgt ltdivg

In [71]:
# Ajuster le nombre maximum de caractères affichés
pd.set_option('display.max_colwidth', None)

In [72]:
df['Body'].head()

0    website im working click sign jquery dialoge modal pop click ok submit form cant hit enter need able enter work also seems like work doesnt im using jquery132js also php file following piece code lttr valigntop alignright styleheight40pxgtlttd gt ltdiv idsigningt lttable stylemargintop4pxmarginright4pxborderstylesolidborderwidth1pxgt lttrgtlttd stylewidth165pxgt ltdivgtltcentergt lta titlesign onclicklogindialogopen hrefjavascriptvoidgtsign inltagtltbgtampnbspampnbsp ampnbspampnbspltbgt lta titlecreate account hrefcreateaccounthtmlgtcreate accountltagt ltcentergtltdivgt lttdgtlttrgt lttablegt ltdivgt lttdgtlttrgt ltdiv idsignin_dialog gt ltdiv idbggt ltlabelgtltspangtemailltspangtltlabelgt ltinput typetext nameemail idemail classdialoginputtextgt ltbrgt ltlabelgtltspangtpasswordltspangtltlabelgt ltinput typepassword namepassword idpassword classdialoginputtextgt ltbrgt ltbrgt ltcentergtltbgtltlabel idlogin_error stylecolorredgtltspangtampnbspltspangtltlabelgtltcentergtltbgt ltdivg

In [51]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,ActivityTime
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you hav...",[<javascript><jquery><ajax><jquery-ui><enter>],1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,444
1,How do I do large non-blocking updates in PostgreSQL?,"I want to do a large update on a table in PostgreSQL, but I don't need the transactional integri...",[<postgresql><transactions><sql-update><plpgsql><dblink>],1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5045
2,Manually setting session ID in Express,I have an Angular/Express app and am trying to implement some kind of restful auth. The Express ...,[<javascript><angularjs><node.js><express><passport.js>],20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,415
3,Include Google Maps API Key in open source project?,Is it okay to put your Google Maps API Key into your source code and publish it?Others could tak...,[<security><api><open-source><google-maps><publish>],1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,0
4,Emacs ido-style shell,Is there a command line shell or shell customization that supports emacs-style ido find file? I...,[<bash><shell><emacs><eshell><ido>],1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,3839
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,My IQueryable line is: // find all timesheets for this period - from db so System.Data.Linq.Data...,[<c#><asp.net-mvc><linq-to-sql><list><iqueryable>],2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,0
49996,How can I gzip my JavaScript and CSS files?,"I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, wher...",[<javascript><css><apache><http><gzip>],2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,1994
49997,how to know location of return address on stack c/c++,i have been reading about a function that can overwrite its return address.void foo(const char* ...,[<c++><c><winapi><x86><stack>],2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,0
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,I wonder if it is possible to use firebase cloud messaging with iOS app without Apple Developer ...,[<ios><swift><firebase><apple-push-notifications><firebase-cloud-messaging>],40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,68


# <a name="C2">II. Feature Engineering</a>

In [64]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,nb_of_tags
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you hav...",<javascript><jquery><ajax><jquery-ui><enter>,1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,5
1,How do I do large non-blocking updates in PostgreSQL?,"I want to do a large update on a table in PostgreSQL, but I don't need the transactional integri...",<postgresql><transactions><sql-update><plpgsql><dblink>,1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5
2,Manually setting session ID in Express,I have an Angular/Express app and am trying to implement some kind of restful auth. The Express ...,<javascript><angularjs><node.js><express><passport.js>,20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,5
3,Include Google Maps API Key in open source project?,Is it okay to put your Google Maps API Key into your source code and publish it?Others could tak...,<security><api><open-source><google-maps><publish>,1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,5
4,Emacs ido-style shell,Is there a command line shell or shell customization that supports emacs-style ido find file? I...,<bash><shell><emacs><eshell><ido>,1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,5
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,My IQueryable line is: // find all timesheets for this period - from db so System.Data.Linq.Data...,<c#><asp.net-mvc><linq-to-sql><list><iqueryable>,2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,5
49996,How can I gzip my JavaScript and CSS files?,"I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, wher...",<javascript><css><apache><http><gzip>,2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,5
49997,how to know location of return address on stack c/c++,i have been reading about a function that can overwrite its return address.void foo(const char* ...,<c++><c><winapi><x86><stack>,2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,5
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,I wonder if it is possible to use firebase cloud messaging with iOS app without Apple Developer ...,<ios><swift><firebase><apple-push-notifications><firebase-cloud-messaging>,40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,5


In [104]:
# Convertir les variables 'CreationDate' et 'LastActivityDate' en format de date
df['CreationDate'] = pd.to_datetime(df['CreationDate'])
df['LastActivityDate'] = pd.to_datetime(df['LastActivityDate'])

# Calculer la différence en jours entre les deux variables
df['ActivityTime'] = (df['LastActivityDate'] - df['CreationDate']).dt.days

In [49]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,ActivityTime
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you hav...",<javascript><jquery><ajax><jquery-ui><enter>,1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,444
1,How do I do large non-blocking updates in PostgreSQL?,"I want to do a large update on a table in PostgreSQL, but I don't need the transactional integri...",<postgresql><transactions><sql-update><plpgsql><dblink>,1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5045
2,Manually setting session ID in Express,I have an Angular/Express app and am trying to implement some kind of restful auth. The Express ...,<javascript><angularjs><node.js><express><passport.js>,20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,415
3,Include Google Maps API Key in open source project?,Is it okay to put your Google Maps API Key into your source code and publish it?Others could tak...,<security><api><open-source><google-maps><publish>,1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,0
4,Emacs ido-style shell,Is there a command line shell or shell customization that supports emacs-style ido find file? I...,<bash><shell><emacs><eshell><ido>,1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,3839
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,My IQueryable line is: // find all timesheets for this period - from db so System.Data.Linq.Data...,<c#><asp.net-mvc><linq-to-sql><list><iqueryable>,2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,0
49996,How can I gzip my JavaScript and CSS files?,"I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, wher...",<javascript><css><apache><http><gzip>,2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,1994
49997,how to know location of return address on stack c/c++,i have been reading about a function that can overwrite its return address.void foo(const char* ...,<c++><c><winapi><x86><stack>,2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,0
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,I wonder if it is possible to use firebase cloud messaging with iOS app without Apple Developer ...,<ios><swift><firebase><apple-push-notifications><firebase-cloud-messaging>,40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,68


# <a name="C2">III. Feature Extraction</a>

In [18]:
# Fonction pour extraire les tags d'une chaîne de caractères
def extract_tags(tag_string):
    tags = re.findall(r'<(.*?)>', tag_string)
    return tags

# Application de la fonction d'extraction des tags à la colonne 'tags' du DataFrame
df['Tags'] = df['Tags'].apply(extract_tags)

# Extraction des tags uniques à partir de toutes les lignes
unique_tags = set([tag for tags_list in df['Tags'] for tag in tags_list])

# Affichage des tags uniques
print("Tags uniques :")
for tag in unique_tags:
    print(tag)

Tags uniques :
matching
lme4
saxon-js
stdmutex
db-schema
title
cldc
specification-pattern
multibinding
mule-module-jpa
modbus
svelte
conditional-expressions
scribd
mit-scratch
python-multithreading
avaudioplayer
hot-reload
simplejson
ruby-on-rails
ruby-on-rails-5.1
scala-option
decimal-point
htc-android
dockerhub
currentculture
nserror
avatar
gost3410
audio-fingerprinting
hashcode
filelist
plpython
strong-parameters
postsharp
serviceconnection
countdownlatch
eager
bytearrayinputstream
metatable
breeze
delorian
opam
sql-server
puppet
indy10
twitter-streaming-api
ios8
pylint
dashboard
setwindowshookex
iasyncresult
selenium-remotedriver
joomla1.6
system.speech.recognition
nav
blending
brms
relational-database
typehandler
stm32f4discovery
apache-velocity
asf
foaf
ext3
mouse
link-grammar
machine.config
gs1-ai-syntax
animatewithduration
typeinfo
html5-video
wrapper
android-build-flavors
sysdate
domain-model
richedit
gwidgets
callcontext
rstudio
ruby
datamodel
zbar
mediafire
vagrant-windows
t

In [19]:
list_tags = [tag for tags_list in df['Tags'] for tag in tags_list]

In [25]:
len(list_tags)

250000

In [26]:
from collections import Counter

In [27]:
counter = Counter(list_tags)

In [30]:
counter.most_common(10)

[('c#', 6487),
 ('java', 5969),
 ('javascript', 5285),
 ('python', 4648),
 ('c++', 4587),
 ('android', 3358),
 ('ios', 3338),
 ('.net', 2956),
 ('html', 2397),
 ('php', 2196)]

In [31]:
list_tag_common = ['c#', 'java', 'javascript', 'python', 'c++', 'android', 'ios', '.net', 'html', 'php']

In [33]:
pd.set_option('display.max_colwidth', 100)

df.head()

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,Body_nobalise
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ...","[javascript, jquery, ajax, jquery-ui, enter]",1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,"<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ..."
1,How do I do large non-blocking updates in PostgreSQL?,"<p>I want to do a large update on a table in PostgreSQL, but I don't need the transactional inte...","[postgresql, transactions, sql-update, plpgsql, dblink]",1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,"<p>I want to do a large update on a table in PostgreSQL, but I don't need the transactional inte..."
2,Manually setting session ID in Express,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...,"[javascript, angularjs, node.js, express, passport.js]",20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...
3,Include Google Maps API Key in open source project?,<p>Is it okay to put your Google Maps API Key into your source code and publish it?</p>\n\n<p>Ot...,"[security, api, open-source, google-maps, publish]",1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,<p>Is it okay to put your Google Maps API Key into your source code and publish it?</p>\n\n<p>Ot...
4,Emacs ido-style shell,<p>Is there a command line shell or shell customization that supports emacs-style ido find file?...,"[bash, shell, emacs, eshell, ido]",1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,<p>Is there a command line shell or shell customization that supports emacs-style ido find file?...


In [34]:
list_tags_test = ['python', 'green', 'bonjour']

In [42]:
list_tags_test2 = ['green', 'bonjour']

In [36]:
for tags in list_tags_test:
    if tags in list_tag_common:
        print(tags) 
    else:

    print('not found')


python
not found
not found


In [44]:
def extract_tags(list_to_check):
    for tags in list_to_check:
        if tags in list_tag_common:
            return tags
    return ''

In [45]:
extract_tags()

''

In [47]:
df['main_tag'] = df['Tags'].apply(extract_tags)

In [49]:
df['main_tag'].head(49)

0     javascript
1               
2     javascript
3               
4               
5               
6               
7            c++
8     javascript
9               
10            c#
11        python
12    javascript
13    javascript
14            c#
15            c#
16    javascript
17              
18    javascript
19              
20              
21            c#
22            c#
23              
24        python
25              
26            c#
27              
28            c#
29           c++
30    javascript
31            c#
32          .net
33              
34              
35       android
36          java
37              
38            c#
39              
40    javascript
41           c++
42              
43           php
44            c#
45              
46           ios
47           php
48              
Name: main_tag, dtype: object

In [52]:
df_main_tag = df[df['main_tag'] != '']

In [54]:
df_main_tag

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,Body_nobalise,main_tag
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ...","[javascript, jquery, ajax, jquery-ui, enter]",1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,"<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ...",javascript
2,Manually setting session ID in Express,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...,"[javascript, angularjs, node.js, express, passport.js]",20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...,javascript
7,"C++ RTTI in a Windows 64-bit VectoredExceptionHandler, MS Visual Studio 2015",<p>I'm working on small Windows Exception handling engine trying to gather maximum information f...,"[c++, visual-studio, exception, x86-64, rtti]",39113168,8,2230,1,2016-08-24 01:44:29,2016-08-26 02:19:16,2,<p>I'm working on small Windows Exception handling engine trying to gather maximum information f...,c++
8,How to prevent the keyboard from popping up on mobile devices?,"<p><a href=""http://api.jqueryui.com/spinner/"" rel=""noreferrer"">http://api.jqueryui.com/spinner/<...","[javascript, jquery, html, css, mobile]",39113558,5,23143,4,2016-08-24 02:39:08,2018-10-05 12:15:51,4,"<p><a href=""http://api.jqueryui.com/spinner/"" rel=""noreferrer"">http://api.jqueryui.com/spinner/<...",javascript
10,How do you do dependency injection with AutoFac and OWIN?,<p>This is for MVC5 and the new pipeline. I cannot find a good example anywhere.</p>\n\n<pre><c...,"[c#, dependency-injection, asp.net-mvc-5, owin, autofac]",20061082,15,18422,1,2013-11-19 00:52:43,2016-05-17 13:10:51,3,<p>This is for MVC5 and the new pipeline. I cannot find a good example anywhere.</p>\n\n<pre><c...,c#
...,...,...,...,...,...,...,...,...,...,...,...,...
49993,Lazy-loading visible items in a Listview,"<p>I have a listview which uses the following code:</p>\n\n<pre><code>&lt;ListView x:Name=""Displ...","[c#, wpf, xaml, data-binding, lazy-loading]",21319143,5,10079,1,2014-01-23 20:57:18,2021-04-27 06:24:54,9,"<p>I have a listview which uses the following code:</p>\n\n<pre><code>&lt;ListView x:Name=""Displ...",c#
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,<p>My IQueryable line is:</p>\n\n<pre><code> // find all timesheets for this period - from db so...,"[c#, asp.net-mvc, linq-to-sql, list, iqueryable]",2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,<p>My IQueryable line is:</p>\n\n<pre><code> // find all timesheets for this period - from db so...,c#
49996,How can I gzip my JavaScript and CSS files?,"<p>I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, w...","[javascript, css, apache, http, gzip]",2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,"<p>I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, w...",javascript
49997,how to know location of return address on stack c/c++,<p>i have been reading about a function that can overwrite its return address.</p>\n\n<pre><code...,"[c++, c, winapi, x86, stack]",2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,<p>i have been reading about a function that can overwrite its return address.</p>\n\n<pre><code...,c++


In [109]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,ActivityTime
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form",website im working click sign jquery dialoge modal pop click ok submit form cant hit enter need able enter work also seems like work doesnt im using jquery132js also php file following piece code lttr valigntop alignright styleheight40pxgtlttd gt ltdiv idsigningt lttable stylemargintop4pxmarginright4pxborderstylesolidborderwidth1pxgt lttrgtlttd stylewidth165pxgt ltdivgtltcentergt lta titlesign onclicklogindialogopen hrefjavascriptvoidgtsign inltagtltbgtampnbspampnbsp ampnbspampnbspltbgt lta titlecreate account hrefcreateaccounthtmlgtcreate accountltagt ltcentergtltdivgt lttdgtlttrgt lttablegt ltdivgt lttdgtlttrgt ltdiv idsignin_dialog gt ltdiv idbggt ltlabelgtltspangtemailltspangtltlabelgt ltinput typetext nameemail idemail classdialoginputtextgt ltbrgt ltlabelgtltspangtpasswordltspangtltlabelgt ltinput typepassword namepassword idpassword classdialoginputtextgt ltbrgt ltbrgt ltcentergtltbgtltlabel idlogin_error stylecolorredgtltspangtampnbspltspangtltlabelgtltcentergtltbgt ltdivgt ltdivgt ltscriptgt login_dialogdialog autoopen false width 310 overlay opacity 05 background black modal true button ok function bodyaddclasscurwait sql select client_id user email email0value login_password password0value getbongodataphp tasksqlresulttojson sql sql resultofloginattempt json cancel function thisdialogclose ltscriptgt javascript file following function function logindialogopen login_dialogdialogopen login_dialogkeypressfunctione ewhich 13 bodyaddclasscurwait sql select client_id user email email0value login_password password0value getbongodataphp tasksqlresulttojson sql sql resultofloginattempt json code dont understand isnt working also try login_dialogdialogisopen right opened always returned false oddly enough please help,"[javascript, jquery, ajax, jquery-ui, enter]",1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,444
1,How do I do large non-blocking updates in PostgreSQL?,want large update table postgresql dont need transactional integrity maintained across entire operation know column im changing going written read update want know easy way psql console make type operation faster example let say table called order 35 million row want update order set status null avoid diverted offtopic discussion let assume value status 35 million column currently set nonnull value thus rendering index useless problem statement take long time go effect solely locking changed row locked entire update complete update might take 5 hour whereas something like update order set status null order_id gt 0 order_id lt 1000000 might take 1 minute 35 million row breaking chunk 35 would take 35 minute save 4 hour 25 minute could break even script using pseudocode 0 3500 db_operation update order set status null order_id gt i1000 order_id lt i11000 operation might complete minute rather 35 come im really asking dont want write freaking script break operation every single time want big onetime update like way accomplish want entirely within sql,"[postgresql, transactions, sql-update, plpgsql, dblink]",1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5045
2,Manually setting session ID in Express,angularexpress app trying implement kind restful auth express app passport standard usernamepass login redis session successful login return session id angular sends every request header problem dont know make express make use session id tried writing reqsessionid middleware success use header query string way send session id along,"[javascript, angularjs, node.js, express, passport.js]",20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,415
3,Include Google Maps API Key in open source project?,okay put google map api key source code publish others could take misuse dont want every developer user get api key type somewhere owner key responsible create new google account project project desktop application objectivec small developer tool would best way make convenient,"[security, api, open-source, google-maps, publish]",1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,0
4,Emacs ido-style shell,command line shell shell customization support emacsstyle ido find file emacs navigate directory extremely quickly using cx cf idomode ideally im looking solution used outside emacs though id open way quickly change directory within eshell buffer,"[bash, shell, emacs, eshell, ido]",1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,3839
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collection.Generic.List) object?,iqueryable line find timesheets period db systemdatalinqdataquery var timesheets _timesheetrepositoryfindbyperioddte1 dte2 list line get team ad active directory systemcollectiongenericlist var adusers _aduserrepositorygetmyteamuseridentityname wish show timesheets user timesheet collection present user collection use standard c expression var teamsheets timesheets join user adusers tuser1username equal userfullname select get error iqueryable return selfreferencing constant expression supported recommendation,"[c#, asp.net-mvc, linq-to-sql, list, iqueryable]",2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,0
49996,How can I gzip my JavaScript and CSS files?,problem gzip prototype lib totaly idea start work find tutorial wasnt helpful folder j file compressedjs 1js 2js 3js im calling file test file compressesindexphp ltlink reljavascript typetextjs hrefjstabsjs gt ltlink reljavascript typetextjs hrefjsfbjs gt,"[javascript, css, apache, http, gzip]",2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,1994
49997,how to know location of return address on stack c/c++,reading function overwrite return address void fooconst char input char buf10 extra argument supplied printf cheap trick view stack 8 well see trick look format string printfmy stack look likenpnpnpnpnpn pnn p ie expect pointer pas user input straight secure code public enemy 1 strcpybuf input printfsn buf printfnow stack look likenpnpnpnpnpnpnn sugggested stack would look like address foo 00401000 stack look like 00000000 00000000 7ffdf000 0012ff80 0040108a lt want overwrite return address foo 00410ede question author arbitrarily choose second last value return address foo value added stack bottom top apart function return address value apparently see stack ie isnt filled zero thanks,"[c++, c, winapi, x86, stack]",2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,0
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,wonder possible use firebase cloud messaging io app without apple developer program instance asking whether set certificate apple push notification havent found much information web,"[ios, swift, firebase, apple-push-notifications, firebase-cloud-messaging]",40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,68


In [105]:
df.shape

(50000, 11)

In [106]:
# Initialiser le vectoriseur de mots
vectorizer = CountVectorizer()

# Appliquer le vectoriseur aux questions
X = vectorizer.fit_transform(df['Body'])

# Obtenir la liste des mots (features)
features = vectorizer.get_feature_names()

# Afficher la matrice des fonctionnalités
print("Matrice des fonctionnalités :")
print(X)

# Afficher la liste des mots (features)
print("Liste des mots (features) :")
print(features)



Matrice des fonctionnalités :
  (0, 679278)	1
  (0, 315440)	2
  (0, 685884)	2
  (0, 147713)	2
  (0, 571309)	1
  (0, 342730)	1
  (0, 205513)	1
  (0, 407212)	2
  (0, 486898)	1
  (0, 446112)	2
  (0, 600211)	1
  (0, 254424)	1
  (0, 132309)	1
  (0, 286158)	1
  (0, 226708)	2
  (0, 428084)	1
  (0, 67884)	1
  (0, 685559)	2
  (0, 79689)	3
  (0, 552052)	1
  (0, 359142)	1
  (0, 212966)	1
  (0, 659322)	1
  (0, 342733)	1
  (0, 481001)	1
  :	:
  (49998, 502002)	1
  (49998, 282480)	1
  (49998, 436506)	1
  (49998, 413850)	1
  (49998, 329523)	1
  (49998, 94909)	2
  (49998, 685179)	1
  (49998, 401305)	1
  (49998, 136701)	1
  (49998, 248269)	1
  (49998, 149742)	1
  (49999, 450108)	1
  (49999, 338469)	1
  (49999, 636936)	1
  (49999, 357622)	1
  (49999, 153477)	1
  (49999, 180651)	1
  (49999, 90726)	1
  (49999, 546187)	1
  (49999, 90640)	3
  (49999, 521712)	1
  (49999, 603908)	2
  (49999, 187837)	1
  (49999, 264665)	1
  (49999, 271172)	1
Liste des mots (features) :


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

