# OC PROJET 5 - AUTOMATICALLY CATEGORIZE QUESTIONS
#### CLEANING AND ANALYSIS NOTEBOOK
<br></br>
### SOMMAIRE
- <a href="#C1">I. Nettoyage des données</a>
    
- <a href="#C2">II. Feature Engineering</a>
    
- <a href="#C3">III. Exploration des données</a>
    - 1. Matrice des corrélations
    - 2. Analyse temporelle
    - 3. Analyse Quanti/Quanti
    - 4. Analyse Quanti/Quali
    - 5. Analyse Quali/Quali
    - 6. ACP

<font size="5">1. Importation des librairies</font>

In [1]:
# importation des librairies
import os
import numpy as np
import pandas as pd
import matplotlib as mpl
from matplotlib import font_manager as rcParams
import matplotlib.patheffects as path_effects
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
from scipy.stats import pearsonr
from scipy.stats import f_oneway
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

<font size="5">2. Paramétrages Data Visualisation</font>

In [2]:
# Ajouter une ombre à la police
shadow = path_effects.withSimplePatchShadow(offset = (1, - 0.75), 
shadow_rgbFace = 'darkblue', alpha = 0.25)

# changer la police dans les graphiques, les couleurs 
# et augmenter la résolution d'affichage
plt.rcParams['font.family'] = 'Ebrima'
plt.rcParams['text.color'] = 'white'
plt.rcParams['figure.dpi'] = 200
plt.rcParams['savefig.dpi'] = 200
plt.style.use('dark_background')

# set le theme seaborn
sns.set_style('darkgrid', {'axes.facecolor': '0.2',
'text.color': 'white', 'figure.figsize': (20, 16)})
plt.rcParams['figure.facecolor'] = '0.2'

# suppression de l'affichage max des colonnes
pd.set_option('display.max_columns', None)

### SQL REQUEST CODE 

SELECT TOP 10000000 Title, Body, Tags, Id, Score, ViewCount, AnswerCount,
CreationDate, LastActivityDate, CommentCount

FROM Posts 

WHERE PostTypeId = 1 AND ViewCount > 100 AND Score > 3 AND AnswerCount > 0 
AND LEN(Tags) - LEN(REPLACE(Tags, '<','')) >= 5 AND CommentCount > 0

# <a name="C2">I. Nettoyage des données</a>

In [15]:
df = pd.read_csv('QueryResults.csv')

In [9]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you have to click on OK to submit the form, you can't just hit enter. I need it to be able to have enter work also. It seems like what I have should work, but it doesn't</p>\n\n<p>I'm using jquery-1.3.2.js. I also have a php file with the following piece of code in it: `</p>\n\n<pre><code> &lt;tr valign=""top"" align=""right"" style=""height:40px""&gt;&lt;td &gt;\n\n &lt;div id=""signin""&gt;\n\n &lt;table style=""margin-top:4px;margin-right:4px;border-style:solid;border-width:1px""&gt;\n\n &lt;tr&gt;&lt;td style=""width:165px;""&gt; \n\n &lt;div&gt;&lt;center&gt;\n\n &lt;a title=""Sign In"" onclick=""LoginDialogOpen()"" href=""javascript:void();""&gt;Sign In&lt;/a&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; | &amp;nbsp;&amp;nbsp;&lt;/b&gt;\n\n &lt;a title=""Create Account"" href=""CreateAccount.html""&gt;Create Account&lt;/a&gt;\n\n &lt;/center&gt;&lt;/div&gt; \n\n &lt;/td&gt;&lt;/tr&gt;\n\n &lt;/table&gt;\n\n &lt;/div&gt;\n\n &lt;/td&gt;&lt;/tr&gt;\n</code></pre>\n\n<p></p>\n\n<pre><code> &lt;div id=""Signin_Dialog"" &gt;\n\n &lt;div id=""bg""&gt;\n\n &lt;label&gt;&lt;span&gt;Email:&lt;/span&gt;&lt;/label&gt;\n\n &lt;input type=""text"" name=""email"" id=""email"" class=""dialog-input-text""/&gt;\n\n &lt;br&gt;\n\n\n\n &lt;label&gt;&lt;span&gt;Password:&lt;/span&gt;&lt;/label&gt;\n\n &lt;input type=""password"" name=""password"" id=""password"" class=""dialog-input-text""/&gt;\n\n &lt;br&gt;\n\n &lt;br&gt;\n\n &lt;center&gt;&lt;b&gt;&lt;label id=""login_error"" style=""color:red""&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/label&gt;&lt;/center&gt;&lt;/b&gt;\n\n\n\n &lt;/div&gt;\n\n&lt;/div&gt;\n\n\n\n&lt;script&gt;\n\n $('#login_dialog').dialog({\n\n autoOpen: false,\n\n width: 310,\n\n overlay: { opacity: 0.5, background: ""black"" },\n\n modal: true,\n\n buttons: {\n\n ""Ok"": function() { \n\n $(""body"").addClass(""curWait""); \n\n sql = ""select client_id from users where email = '"" + $(""#email"")[0].value + ""' and login_password='"" + $(""#password"")[0].value + ""'"";\n\n $.get('BongoData.php', { task:""SQLResultToJSON"", sql: sql}, ResultOfLoginAttempt, ""json"");\n\n }, \n\n ""Cancel"": function() { \n\n $(this).dialog(""close""); \n\n } \n\n }\n\n });\n\n\n&lt;/script&gt;`\n</code></pre>\n\n<p>i have a javascript file with the following function:</p>\n\n<pre><code>function LoginDialogOpen(){\n\n $('#login_dialog').dialog('open');\n $('#login_dialog').keypress(function(e) {\n if (e.which == 13) {\n $(""body"").addClass(""curWait""); \n\n sql = ""select client_id from users where email = '"" + $(""#email"")[0].value + ""' and login_password='"" + $(""#password"")[0].value + ""'"";\n\n $.get('BongoData.php', { task:""SQLResultToJSON"", sql: sql}, ResultOfLoginAttempt, ""json"");\n }\n});\n</code></pre>\n\n<p>}</p>\n\n<p>That is the code I have, I don't understand why it isn't working. </p>\n\n<p>I also had it try $('#login_dialog').dialog('isOpen'); right after i opened it, but it always returned false oddly enough. Please help if you can.</p>\n",<javascript><jquery><ajax><jquery-ui><enter>,1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3
1,How do I do large non-blocking updates in PostgreSQL?,"<p>I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way <em>in the psql console</em> to make these types of operations faster. </p>\n\n<p>For example, let's say I have a table called ""orders"" with 35 million rows, and I want to do this: </p>\n\n<pre><code>UPDATE orders SET status = null;\n</code></pre>\n\n<p>To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.</p>\n\n<p>The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like </p>\n\n<pre><code>UPDATE orders SET status = null WHERE (order_id &gt; 0 and order_id &lt; 1000000);\n</code></pre>\n\n<p>might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.</p>\n\n<p>I could break it down even further with a script (using pseudocode here):</p>\n\n<pre><code>for (i = 0 to 3500) {\n db_operation (""UPDATE orders SET status = null\n WHERE (order_id &gt;"" + (i*1000)""\n + "" AND order_id &lt;"" + ((i+1)*1000) "" + "")"");\n}\n</code></pre>\n\n<p>This operation might complete in only a few minutes, rather than 35. </p>\n\n<p>So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?</p>\n",<postgresql><transactions><sql-update><plpgsql><dblink>,1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4
2,Manually setting session ID in Express,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Express app has Passport standard username/pass login and Redis sessions. On successful login I return the session ID and angular then sends this to every request in the headers. The problem is that I don't know how to make Express make use of this as the session ID. I have tried writing to req.sessionId in middleware with no success. </p>\n\n<p>How can I use headers or query string as a way to send the session id along. </p>\n,<javascript><angularjs><node.js><express><passport.js>,20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2
3,Include Google Maps API Key in open source project?,"<p>Is it okay to put your Google Maps API Key into your source code and publish it?</p>\n\n<p>Others could take it and misuse it, but I don't want every developer / user to get their own API key and type it in somewhere. If the owner of the key is responsible, should I create a new google account for the project? (The project is a desktop application in Objective-C and a small developer tool.)</p>\n\n<p>What would be the best way to make this convenient?</p>\n",<security><api><open-source><google-maps><publish>,1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1
4,Emacs ido-style shell,"<p>Is there a command line shell or shell customization that supports emacs-style ido find file? In emacs, I can navigate to a directory extremely quickly using <code>C-x C-f</code> and <code>(ido-mode t)</code>. </p>\n\n<p>Ideally, I'm looking for a solution that can be used outside of emacs. Though I'd be open for a way to quickly change directories within an eshell buffer.</p>\n",<bash><shell><emacs><eshell><ido>,1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5
...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collection.Generic.List) object?,"<p>My IQueryable line is:</p>\n\n<pre><code> // find all timesheets for this period - from db so System.Data.Linq.DataQuery\n var timesheets = _timesheetRepository.FindByPeriod(dte1, dte2);\n</code></pre>\n\n<p>My List line is:</p>\n\n<pre><code> // get my team from AD - from active directory so System.Collection.Generic.List\n var adUsers = _adUserRepository.GetMyTeam(User.Identity.Name);\n</code></pre>\n\n<p>I wish to only show timesheets for those users in the timesheet collection that are present in the user collection.</p>\n\n<p>If I use a standard c# expression such as:</p>\n\n<pre><code> var teamsheets = from t in timesheets\n join user in adUsers on t.User1.username equals user.fullname\n select t;\n</code></pre>\n\n<p>I get the error ""An IQueryable that returns a self-referencing Constant expression is not supported""</p>\n\n<p>Any recommendations?</p>\n",<c#><asp.net-mvc><linq-to-sql><list><iqueryable>,2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1
49996,How can I gzip my JavaScript and CSS files?,"<p>I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, where to start and how does it works. :)</p>\n\n<p>I find some tutorials but that wasn't helpful...</p>\n\n<p>So I have a folder with my JS Files:</p>\n\n<p>/compressed/js/\n1.js\n2.js\n3.js</p>\n\n<p>I'm calling these files for a test in this file</p>\n\n<p>/compresses/index.php</p>\n\n<pre><code>&lt;link rel=""javascript"" type=""text/js"" href=""js/tabs.js"" /&gt;\n&lt;link rel=""javascript"" type=""text/js"" href=""js/fb.js"" /&gt;\n</code></pre>\n\n<p>So what do I have to do? :)</p>\n",<javascript><css><apache><http><gzip>,2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1
49997,how to know location of return address on stack c/c++,"<p>i have been reading about a function that can overwrite its return address.</p>\n\n<pre><code>void foo(const char* input)\n{\n char buf[10];\n\n //What? No extra arguments supplied to printf?\n //It's a cheap trick to view the stack 8-)\n //We'll see this trick again when we look at format strings.\n printf(""My stack looks like:\n%p\n%p\n%p\n%p\n%p\n% p\n\n""); //%p ie expect pointers\n\n //Pass the user input straight to secure code public enemy #1.\n strcpy(buf, input);\n printf(""%s\n"", buf);\n\n printf(""Now the stack looks like:\n%p\n%p\n%p\n%p\n%p\n%p\n\n"");\n} \n</code></pre>\n\n<p>It was sugggested that this is how the stack would look like </p>\n\n<p>Address of foo = <strong>00401000</strong> </p>\n\n<p><strong>My stack looks like:</strong><br>\n00000000<br>\n00000000<br>\n7FFDF000<br>\n0012FF80<br>\n<strong>0040108A &lt;-- We want to overwrite the return address for foo.</strong><br>\n00410EDE </p>\n\n<p><strong>Question:</strong><br>\n-. Why did the author arbitrarily choose the second last value as the return address of foo()?</p>\n\n<p>-. Are values added to the stack from the bottom or from the top?</p>\n\n<ul>\n<li>apart from the function return address, what are the other values i apparently see on the stack? ie why isn't it filled with zeros </li>\n</ul>\n\n<p>Thanks.</p>\n",<c++><c><winapi><x86><stack>,2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,"<p>I wonder if it is possible to use firebase cloud messaging with iOS app without Apple Developer Program? </p>\n\n<p><strong><em>For instance, I am asking whether I can set up certificates for Apple Push Notification?</em></strong> </p>\n\n<p>I haven't found much information on the web.</p>\n",<ios><swift><firebase><apple-push-notifications><firebase-cloud-messaging>,40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2


In [10]:
pd.set_option('display.max_colwidth', None)

df['Body'].head()

0    <p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you have to click on OK to submit the form, you can't just hit enter.  I need it to be able to have enter work also.  It seems like what I have should work, but it doesn't</p>\n\n<p>I'm using jquery-1.3.2.js. I also have a php file with the following piece of code in it: `</p>\n\n<pre><code>  &lt;tr valign="top" align="right" style="height:40px"&gt;&lt;td &gt;\n\n    &lt;div id="signin"&gt;\n\n      &lt;table style="margin-top:4px;margin-right:4px;border-style:solid;border-width:1px"&gt;\n\n        &lt;tr&gt;&lt;td style="width:165px;"&gt;  \n\n            &lt;div&gt;&lt;center&gt;\n\n            &lt;a title="Sign In" onclick="LoginDialogOpen()" href="javascript:void();"&gt;Sign In&lt;/a&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; | &amp;nbsp;&amp;nbsp;&lt;/b&gt;\n\n            &lt;a title="Create Account" href="CreateAccount.html"&gt;Create Account&lt;/a&gt;\n\n            &lt;/center&gt;&lt;/div&gt;

In [26]:
df['Body_nobalise'] = df['Body'].str.replace(r'<code>.*?</code>', '', regex=True)

In [27]:
df['Body_nobalise'].head()

0    <p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you have to click on OK to submit the form, you can't just hit enter.  I need it to be able to have enter work also.  It seems like what I have should work, but it doesn't</p>\n\n<p>I'm using jquery-1.3.2.js. I also have a php file with the following piece of code in it: `</p>\n\n<pre><code>  &lt;tr valign="top" align="right" style="height:40px"&gt;&lt;td &gt;\n\n    &lt;div id="signin"&gt;\n\n      &lt;table style="margin-top:4px;margin-right:4px;border-style:solid;border-width:1px"&gt;\n\n        &lt;tr&gt;&lt;td style="width:165px;"&gt;  \n\n            &lt;div&gt;&lt;center&gt;\n\n            &lt;a title="Sign In" onclick="LoginDialogOpen()" href="javascript:void();"&gt;Sign In&lt;/a&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; | &amp;nbsp;&amp;nbsp;&lt;/b&gt;\n\n            &lt;a title="Create Account" href="CreateAccount.html"&gt;Create Account&lt;/a&gt;\n\n            &lt;/center&gt;&lt;/div&gt;

In [28]:
# Fonction de nettoyage pour enlever les balises de code
def nettoyer_code(texte):
    # Utiliser une expression régulière pour trouver les balises de code
    pattern = r'<code>(.*?)</code>|<pre><code>(.*?)</code></pre>'
    matches = re.findall(pattern, texte)
    
    # Supprimer les balises de code et retourner le texte nettoyé
    for match in matches:
        code = match[0] or match[1]  # Sélectionner le premier groupe de capture non vide
        texte = texte.replace(match[0], code).replace(match[1], code)
    
    return texte

In [29]:
df['body_clean'] = df['Body'].apply(nettoyer_code)

MemoryError: 

In [5]:
df.isnull().sum()

Title               0
Body                0
Tags                0
Id                  0
Score               0
ViewCount           0
AnswerCount         0
CreationDate        0
LastActivityDate    0
CommentCount        0
dtype: int64

In [6]:
df.loc[df.duplicated(keep = False),:]

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount


In [7]:
df.dtypes

Title               object
Body                object
Tags                object
Id                   int64
Score                int64
ViewCount            int64
AnswerCount          int64
CreationDate        object
LastActivityDate    object
CommentCount         int64
dtype: object

In [8]:
df.describe()

Unnamed: 0,Id,Score,ViewCount,AnswerCount,CommentCount
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,26200700.0,28.04252,26830.93,3.39866,3.68868
std,19303420.0,194.846305,139791.1,4.026946,3.349398
min,4.0,4.0,101.0,1.0,1.0
25%,9724780.0,5.0,1945.0,1.0,1.0
50%,22996860.0,7.0,5350.0,2.0,3.0
75%,38857220.0,14.0,15471.75,4.0,5.0
max,76343880.0,25632.0,12760110.0,134.0,51.0


In [30]:
# Fonction pour enlever la balise <p>
def remove_html_tags(text):
    clean = re.compile('<.*?>|\n')
    return re.sub(clean, '', text)

In [11]:
def preprocess_text(text):
    # Supprimer les balises HTML
    text = re.sub('<.*?>', '', text)
    
    # Convertir en minuscules
    text = text.lower()
    
    # Supprimer la ponctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Supprimer les mots vides (stop words)
    stop_words = set(stopwords.words('english'))
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatisation des mots
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Rejoindre les tokens prétraités en une seule chaîne de texte
    processed_text = ' '.join(tokens)
    
    return processed_text

In [12]:
df['body_preprocess'] = df['Body'].apply(preprocess_text)

In [71]:
# Ajuster le nombre maximum de caractères affichés
pd.set_option('display.max_colwidth', None)

In [13]:
df['Body'].head()

0    <p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you have to click on OK to submit the form, you can't just hit enter.  I need it to be able to have enter work also.  It seems like what I have should work, but it doesn't</p>\n\n<p>I'm using jquery-1.3.2.js. I also have a php file with the following piece of code in it: `</p>\n\n<pre><code>  &lt;tr valign="top" align="right" style="height:40px"&gt;&lt;td &gt;\n\n    &lt;div id="signin"&gt;\n\n      &lt;table style="margin-top:4px;margin-right:4px;border-style:solid;border-width:1px"&gt;\n\n        &lt;tr&gt;&lt;td style="width:165px;"&gt;  \n\n            &lt;div&gt;&lt;center&gt;\n\n            &lt;a title="Sign In" onclick="LoginDialogOpen()" href="javascript:void();"&gt;Sign In&lt;/a&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; | &amp;nbsp;&amp;nbsp;&lt;/b&gt;\n\n            &lt;a title="Create Account" href="CreateAccount.html"&gt;Create Account&lt;/a&gt;\n\n            &lt;/center&gt;&lt;/div&gt;

In [51]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,ActivityTime
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you hav...",[<javascript><jquery><ajax><jquery-ui><enter>],1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,444
1,How do I do large non-blocking updates in PostgreSQL?,"I want to do a large update on a table in PostgreSQL, but I don't need the transactional integri...",[<postgresql><transactions><sql-update><plpgsql><dblink>],1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5045
2,Manually setting session ID in Express,I have an Angular/Express app and am trying to implement some kind of restful auth. The Express ...,[<javascript><angularjs><node.js><express><passport.js>],20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,415
3,Include Google Maps API Key in open source project?,Is it okay to put your Google Maps API Key into your source code and publish it?Others could tak...,[<security><api><open-source><google-maps><publish>],1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,0
4,Emacs ido-style shell,Is there a command line shell or shell customization that supports emacs-style ido find file? I...,[<bash><shell><emacs><eshell><ido>],1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,3839
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,My IQueryable line is: // find all timesheets for this period - from db so System.Data.Linq.Data...,[<c#><asp.net-mvc><linq-to-sql><list><iqueryable>],2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,0
49996,How can I gzip my JavaScript and CSS files?,"I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, wher...",[<javascript><css><apache><http><gzip>],2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,1994
49997,how to know location of return address on stack c/c++,i have been reading about a function that can overwrite its return address.void foo(const char* ...,[<c++><c><winapi><x86><stack>],2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,0
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,I wonder if it is possible to use firebase cloud messaging with iOS app without Apple Developer ...,[<ios><swift><firebase><apple-push-notifications><firebase-cloud-messaging>],40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,68


In [16]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

def tokenizer_fct(sentence) :
    # print(sentence)
    sentence_clean = sentence.replace('-', ' ').replace('+', ' ').replace('/', ' ').replace('#', ' ')
    word_tokens = word_tokenize(sentence_clean)
    return word_tokens

# Stop words
from nltk.corpus import stopwords
stop_w = list(set(stopwords.words('english'))) + ['[', ']', ',', '.', ':', '?', '(', ')']

def stop_word_filter_fct(list_words) :
    filtered_w = [w for w in list_words if not w in stop_w]
    filtered_w2 = [w for w in filtered_w if len(w) > 2]
    return filtered_w2

# lower case et alpha
def lower_start_fct(list_words) :
    lw = [w.lower() for w in list_words if (not w.startswith("@")) 
    # and (not w.startswith("#"))
    and (not w.startswith("http"))]
    return lw

# Lemmatizer (base d'un mot)
from nltk.stem import WordNetLemmatizer

def lemma_fct(list_words) :
    lemmatizer = WordNetLemmatizer()
    lem_w = [lemmatizer.lemmatize(w) for w in list_words]
    return lem_w

# Fonction de préparation du texte pour le bag of words (Countvectorizer et Tf_idf, Word2Vec)
def transform_bow_fct(desc_text) :
    word_tokens = tokenizer_fct(desc_text)
    sw = stop_word_filter_fct(word_tokens)
    lw = lower_start_fct(sw)
    # lem_w = lemma_fct(lw)    
    transf_desc_text = ' '.join(lw)
    return transf_desc_text

# Fonction de préparation du texte pour le bag of words avec lemmatization
def transform_bow_lem_fct(desc_text) :
    word_tokens = tokenizer_fct(desc_text)
    sw = stop_word_filter_fct(word_tokens)
    lw = lower_start_fct(sw)
    lem_w = lemma_fct(lw)    
    transf_desc_text = ' '.join(lem_w)
    return transf_desc_text

# Fonction de préparation du texte pour le Deep learning (USE et BERT)
def transform_dl_fct(desc_text) :
    word_tokens = tokenizer_fct(desc_text)
    # sw = stop_word_filter_fct(word_tokens)
    lw = lower_start_fct(word_tokens)
    # lem_w = lemma_fct(lw)    
    transf_desc_text = ' '.join(lw)
    return transf_desc_text

df['sentence_bow_lem'] = df['Body'].apply(lambda x : transform_bow_lem_fct(x))
df.shape

(50000, 11)

In [None]:
df['sentence_dl'] = df['Body'].apply(lambda x : transform_dl_fct(x))

In [8]:
# Ajuster le nombre maximum de caractères affichés
pd.set_option('display.max_colwidth', None)

In [17]:
df['sentence_bow_lem'].head()

0    website working click sign jquery dialoge modal pop click submit form n't hit enter need able enter work also seems like work n't using jquery 1.3.2.js also php file following piece code pre code valign= top align= right style= height:40px div id= signin table style= margin top:4px margin right:4px border style solid border width:1px style= width:165px div center title= sign onclick= logindialogopen href= javascript void sign amp nbsp amp nbsp amp nbsp amp nbsp title= create account href= createaccount.html create account center div table div code pre pre code div id= signin_dialog div id= label span email span label input type= text name= email id= email class= dialog input text label span password span label input type= password name= password id= password class= dialog input text center label id= login_error style= color red span amp nbsp span label center div div script login_dialog .dialog autoopen false width 310 overlay opacity 0.5 background black modal true button functio

# <a name="C2">II. Feature Engineering</a>

In [64]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,nb_of_tags
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you hav...",<javascript><jquery><ajax><jquery-ui><enter>,1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,5
1,How do I do large non-blocking updates in PostgreSQL?,"I want to do a large update on a table in PostgreSQL, but I don't need the transactional integri...",<postgresql><transactions><sql-update><plpgsql><dblink>,1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5
2,Manually setting session ID in Express,I have an Angular/Express app and am trying to implement some kind of restful auth. The Express ...,<javascript><angularjs><node.js><express><passport.js>,20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,5
3,Include Google Maps API Key in open source project?,Is it okay to put your Google Maps API Key into your source code and publish it?Others could tak...,<security><api><open-source><google-maps><publish>,1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,5
4,Emacs ido-style shell,Is there a command line shell or shell customization that supports emacs-style ido find file? I...,<bash><shell><emacs><eshell><ido>,1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,5
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,My IQueryable line is: // find all timesheets for this period - from db so System.Data.Linq.Data...,<c#><asp.net-mvc><linq-to-sql><list><iqueryable>,2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,5
49996,How can I gzip my JavaScript and CSS files?,"I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, wher...",<javascript><css><apache><http><gzip>,2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,5
49997,how to know location of return address on stack c/c++,i have been reading about a function that can overwrite its return address.void foo(const char* ...,<c++><c><winapi><x86><stack>,2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,5
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,I wonder if it is possible to use firebase cloud messaging with iOS app without Apple Developer ...,<ios><swift><firebase><apple-push-notifications><firebase-cloud-messaging>,40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,5


In [104]:
# Convertir les variables 'CreationDate' et 'LastActivityDate' en format de date
df['CreationDate'] = pd.to_datetime(df['CreationDate'])
df['LastActivityDate'] = pd.to_datetime(df['LastActivityDate'])

# Calculer la différence en jours entre les deux variables
df['ActivityTime'] = (df['LastActivityDate'] - df['CreationDate']).dt.days

In [49]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,ActivityTime
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you hav...",<javascript><jquery><ajax><jquery-ui><enter>,1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,444
1,How do I do large non-blocking updates in PostgreSQL?,"I want to do a large update on a table in PostgreSQL, but I don't need the transactional integri...",<postgresql><transactions><sql-update><plpgsql><dblink>,1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5045
2,Manually setting session ID in Express,I have an Angular/Express app and am trying to implement some kind of restful auth. The Express ...,<javascript><angularjs><node.js><express><passport.js>,20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,415
3,Include Google Maps API Key in open source project?,Is it okay to put your Google Maps API Key into your source code and publish it?Others could tak...,<security><api><open-source><google-maps><publish>,1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,0
4,Emacs ido-style shell,Is there a command line shell or shell customization that supports emacs-style ido find file? I...,<bash><shell><emacs><eshell><ido>,1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,3839
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,My IQueryable line is: // find all timesheets for this period - from db so System.Data.Linq.Data...,<c#><asp.net-mvc><linq-to-sql><list><iqueryable>,2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,0
49996,How can I gzip my JavaScript and CSS files?,"I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, wher...",<javascript><css><apache><http><gzip>,2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,1994
49997,how to know location of return address on stack c/c++,i have been reading about a function that can overwrite its return address.void foo(const char* ...,<c++><c><winapi><x86><stack>,2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,0
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,I wonder if it is possible to use firebase cloud messaging with iOS app without Apple Developer ...,<ios><swift><firebase><apple-push-notifications><firebase-cloud-messaging>,40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,68


# <a name="C2">III. Feature Extraction</a>

In [43]:
# Fonction pour extraire les tags d'une chaîne de caractères
def extract_tags(tag_string):
    tags = re.findall(r'<(.*?)>', tag_string)
    return tags

# Application de la fonction d'extraction des tags à la colonne 'tags' du DataFrame
df['Tags'] = df['Tags'].apply(extract_tags)

# Extraction des tags uniques à partir de toutes les lignes
unique_tags = set([tag for tags_list in df['Tags'] for tag in tags_list])

# Affichage des tags uniques
print("Tags uniques :")
for tag in unique_tags:
    print(tag)

Tags uniques :
openpgp
keypad
stream-wrapper
export-to-csv
jquery-mobile-button
actioncontext
doxygen
inflector
primefaces
sskeychain
estimation
android-apt
git-husky
silverlight-5.0
toolbox
oraclecommand
django-pyodbc
libpq
postbackurl
url-rewriting
variable-variables
runtime-error
babylonjs
dojox.grid
azureportal
constexpr-function
python-docx
groovyclassloader
user-defined-types
apex-code
console
android-fusedlocation
directx
hibernate-mapping
adt
appcelerator
paket
groovyshell
coderush
web-deployment
jarjar
integer-arithmetic
bgr
emr
type-alias
netbeans
javascriptmvc
form-parameter
robocopy
working-directory
ember-controllers
range
netlify
system-stored-procedures
cider
android-multiple-users
sonarqube5.1
mouseevent
leaky-abstraction
getcomputedstyle
scalac
spring-social-facebook
selectionchanged
linux
uiswitch
jsfunit
delphi-5
qvector
servicestack
internet-explorer-6
pascal
anychart
jqwidget
integer-programming
spring-security-acl
final
copying
suppression
vcpkg
fullcalendar
openc

longtable
baseadapter
webpack-plugin
webpack-4
constraint-programming
highlighting
uidatepicker
phpdoc
viewer
sgi
camlp4
category-abstractions
wxmaxima
galera
index-error
zeromq
patchwork
join-hints
metaclass
phantom-read
perspectivecamera
squeak
webvr
packrat
jvm
invokerequired
karma-mocha
mdm
apache-velocity
crosstab
numpad
google-developers-console
selenium-grid
bison
kibana
k2f
safari-extension
tttattributedlabel
form-designer
undefined
nsubstitute
bjam
state-management
xsl-fo
knockout.js
c-str
embperl
appcode
matcher
docfx
directorysearcher
parsley.js
locationlistener
sunspot-solr
sammy.js
type-families
ef-core-3.1
redux-observable
luaj
heap-dump
el
ipad-3
twisted
relative-url
stackdriver
rails-migrations
loaded
synth
jaas
exponential
azure-storage
stack-pointer
keyset
checkbox
npm-install
nrwl-nx
racket
placeholder
calayer
android-maven-plugin
nlme
asp.net-identity
avx512
argon2-ffi
react-dragula
.net-standard-1.5
dependency-inversion
virtualizingstackpanel
react-native
jquery-ui

asp.net-membership
ios7.1
redis-cluster
bin
fasta
notification-channel
octopus-deploy
psutil
skreferencenode
spy
heartbleed-bug
uos
specifier
yarn-workspaces
docker-buildkit
red5pro
cancellation-token
react-navigation-bottom-tab
ef-fluent-api
kubernetes-deployment
popup
rxdart
async.js
compiler-directives
command
lis
javafx-8
credit-card
c++14
valuetask
android-custom-view
sigint
disk-io
pragma
android-maps-v2
borrowing
quill
s-expression
countdowntimer
taylor-series
spring-cloud-function
udf
i18n-gem
floating
android-parser
requestdispatcher
infix-notation
webpack-2
csvhelper
buzzword-compliance
google-hangouts
nuxt.js
remote-debugging
visual-studio-2005
integer-partition
pageviews
amazon-ec2
scipy
adlds
microbenchmark
dbcommand
uniscribe
istio-gateway
ios7
django-swagger
viewstack
spring-dsl
x509certificate
destructor
ledit
cornice
wallet
loader
go-gorm
wordpress-media
partial-functions
gedit
xcode8-beta3
webusercontrol
itanium-abi
aether
cube-dimension
contacts
yslow
diawi
wso2-esb


graylog
ftp-server
counter-cache
vue-apollo
flot
string.h
quadtree
smartclient
windows-defender
parallels
entity-framework-4.3
django-errors
facet
method-group
nodemailer
devart
models
vb.net-2010
sqrt
css-purge
webforms
jupyter
pipeline
alter
ng-modules
.net-framework-version
yesod
google-maps-android-api-2
clientid
sticky
log4cxx
anydac
sparkapi
overlays
tor
webpack
cookiecontainer
sympy
fast-esp
databinder
compound-literals
ping
service-control-manager
qgroundcontrol
sonatype
imdb
rubocop
angularjs-ng-disabled
test-framework
one-time-password
resharper-8.0
aria-live
ora-00942
sonarqube
importerror
optimization
macos
generate-series
strtotime
persistent-volumes
robot
rescue
cufon
react-functional-component
springfox-boot-starter
setlocale
bloom-filter
xapian
pythonnet
windows-runtime
dlquery
streaminsight
component-scan
pointer-to-pointer
lexical-scope
laravel-spark
boost-interprocess
max-heap
structured-exception
dao
graphic
keyboard-layout
for-range
angular2-changedetection
stipple

cx-oracle
localhost
tool-uml
avaudiopcmbuffer
shared-variable
executionengineexception
textselection
vibration
http
networkextension
rule-engine
z80
linq-to-entities
usb
flycheck
abi
quoted-printable
cell
pykafka
avaudioplayer
opcode
stan
elasticsearch-model
meta-inf
sitemesh
sharppcap
ecmascript-next
django-channels
system-error
culerity
nexus3
methodology
argc
great-circle
rmi
stm
fftw
named-pipes
r-plotly
nsurlsessionuploadtask
mixed-content
library-design
geckodriver
slickgrid
easing-functions
waitforsingleobject
jetbrains-ide
objectid
specialized-annotation
datagridcomboboxcolumn
viewchild
craco
kotlin-coroutines
partial
formfield
angular-flex-layout
mysql-error-1093
irony
mongobee
onfling
systemd
uint32
database-first
ria
ms-media-foundation
object-files
shelve
fpdi
lua
alamofire
cntk
basm
riak
iana
calculator
popupmenu
correlation
git-rev-list
countries
criteria-api
reinforcement-learning
msxml
difftool
ctf
n-tier-architecture
python-s3fs
k-fold
driver
delphi-2009
android-paging

In [44]:
list_tags = [tag for tags_list in df['Tags'] for tag in tags_list]

In [45]:
len(list_tags)

250000

In [46]:
from collections import Counter

In [47]:
counter = Counter(list_tags)

In [48]:
counter.most_common(10)

[('c#', 6487),
 ('java', 5969),
 ('javascript', 5285),
 ('python', 4648),
 ('c++', 4587),
 ('android', 3358),
 ('ios', 3338),
 ('.net', 2956),
 ('html', 2397),
 ('php', 2196)]

In [49]:
list_tag_common = ['c#', 'java', 'javascript', 'python', 'c++', 'android', 'ios', '.net', 'html', 'php']

In [50]:
pd.set_option('display.max_colwidth', 100)

df.head()

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,sentence_bow,sentence_bow_lem,sentence_dl
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ...","[javascript, jquery, ajax, jquery-ui, enter]",1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,website working click sign jquery dialoge modal pops click submit form n't hit enter need able e...,website working click sign jquery dialoge modal pop click submit form n't hit enter need able en...,"< p > on a website i 'm working on , when you click sign on , a jquery dialoge modal pops up , b..."
1,How do I do large non-blocking updates in PostgreSQL?,"<p>I want to do a large update on a table in PostgreSQL, but I don't need the transactional inte...","[postgresql, transactions, sql-update, plpgsql, dblink]",1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,want large update table postgresql n't need transactional integrity maintained across entire ope...,want large update table postgresql n't need transactional integrity maintained across entire ope...,"< p > i want to do a large update on a table in postgresql , but i do n't need the transactional..."
2,Manually setting session ID in Express,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...,"[javascript, angularjs, node.js, express, passport.js]",20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,angular express app trying implement kind restful auth the express app passport standard usernam...,angular express app trying implement kind restful auth the express app passport standard usernam...,< p > i have an angular express app and am trying to implement some kind of restful auth . the e...
3,Include Google Maps API Key in open source project?,<p>Is it okay to put your Google Maps API Key into your source code and publish it?</p>\n\n<p>Ot...,"[security, api, open-source, google-maps, publish]",1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,okay put google maps api key source code publish others could take misuse n't want every develop...,okay put google map api key source code publish others could take misuse n't want every develope...,< p > is it okay to put your google maps api key into your source code and publish it ? < p > < ...
4,Emacs ido-style shell,<p>Is there a command line shell or shell customization that supports emacs-style ido find file?...,"[bash, shell, emacs, eshell, ido]",1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,command line shell shell customization supports emacs style ido find file emacs navigate directo...,command line shell shell customization support emacs style ido find file emacs navigate director...,< p > is there a command line shell or shell customization that supports emacs style ido find fi...


In [34]:
list_tags_test = ['python', 'green', 'bonjour']

In [42]:
list_tags_test2 = ['green', 'bonjour']

In [53]:
def extract_tags(list_to_check):
    for tags in list_to_check:
        if tags in list_tag_common:
            return tags
    return ''

In [54]:
extract_tags()

TypeError: extract_tags() missing 1 required positional argument: 'list_to_check'

In [55]:
df['main_tag'] = df['Tags'].apply(extract_tags)

In [56]:
df['main_tag'].head(49)

0     javascript
1               
2     javascript
3               
4               
5               
6               
7            c++
8     javascript
9               
10            c#
11        python
12    javascript
13    javascript
14            c#
15            c#
16    javascript
17              
18    javascript
19              
20              
21            c#
22            c#
23              
24        python
25              
26            c#
27              
28            c#
29           c++
30    javascript
31            c#
32          .net
33              
34              
35       android
36          java
37              
38            c#
39              
40    javascript
41           c++
42              
43           php
44            c#
45              
46           ios
47           php
48              
Name: main_tag, dtype: object

In [57]:
df_main_tag = df[df['main_tag'] != '']

In [58]:
df_main_tag

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,sentence_bow,sentence_bow_lem,sentence_dl,main_tag
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ...","[javascript, jquery, ajax, jquery-ui, enter]",1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,website working click sign jquery dialoge modal pops click submit form n't hit enter need able e...,website working click sign jquery dialoge modal pop click submit form n't hit enter need able en...,"< p > on a website i 'm working on , when you click sign on , a jquery dialoge modal pops up , b...",javascript
2,Manually setting session ID in Express,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...,"[javascript, angularjs, node.js, express, passport.js]",20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,angular express app trying implement kind restful auth the express app passport standard usernam...,angular express app trying implement kind restful auth the express app passport standard usernam...,< p > i have an angular express app and am trying to implement some kind of restful auth . the e...,javascript
7,"C++ RTTI in a Windows 64-bit VectoredExceptionHandler, MS Visual Studio 2015",<p>I'm working on small Windows Exception handling engine trying to gather maximum information f...,"[c++, visual-studio, exception, x86-64, rtti]",39113168,8,2230,1,2016-08-24 01:44:29,2016-08-26 02:19:16,2,working small windows exception handling engine trying gather maximum information system includi...,working small window exception handling engine trying gather maximum information system includin...,< p > i 'm working on small windows exception handling engine trying to gather maximum informati...,c++
8,How to prevent the keyboard from popping up on mobile devices?,"<p><a href=""http://api.jqueryui.com/spinner/"" rel=""noreferrer"">http://api.jqueryui.com/spinner/<...","[javascript, jquery, html, css, mobile]",39113558,5,23143,4,2016-08-24 02:39:08,2018-10-05 12:15:51,4,href= api.jqueryui.com spinner rel= noreferrer api.jqueryui.com spinner trying use jquery spinne...,href= api.jqueryui.com spinner rel= noreferrer api.jqueryui.com spinner trying use jquery spinne...,< p > < a href= '' : api.jqueryui.com spinner `` rel= '' noreferrer '' > : api.jqueryui.com spin...,javascript
10,How do you do dependency injection with AutoFac and OWIN?,<p>This is for MVC5 and the new pipeline. I cannot find a good example anywhere.</p>\n\n<pre><c...,"[c#, dependency-injection, asp.net-mvc-5, owin, autofac]",20061082,15,18422,1,2013-11-19 00:52:43,2016-05-17 13:10:51,3,this mvc5 new pipeline find good example anywhere. pre code public static void configureioc iapp...,this mvc5 new pipeline find good example anywhere. pre code public static void configureioc iapp...,< p > this is for mvc5 and the new pipeline . i can not find a good example anywhere. < p > < pr...,c#
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49993,Lazy-loading visible items in a Listview,"<p>I have a listview which uses the following code:</p>\n\n<pre><code>&lt;ListView x:Name=""Displ...","[c#, wpf, xaml, data-binding, lazy-loading]",21319143,5,10079,1,2014-01-23 20:57:18,2021-04-27 06:24:54,9,listview uses following code pre code listview name= display itemssource= binding background= 37...,listview us following code pre code listview name= display itemssource= binding background= 3737...,< p > i have a listview which uses the following code : < p > < pre > < code > & lt ; listview x...,c#
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,<p>My IQueryable line is:</p>\n\n<pre><code> // find all timesheets for this period - from db so...,"[c#, asp.net-mvc, linq-to-sql, list, iqueryable]",2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,iqueryable line pre code find timesheets period system.data.linq.dataquery var timesheets _times...,iqueryable line pre code find timesheets period system.data.linq.dataquery var timesheets _times...,< p > my iqueryable line is : < p > < pre > < code > find all timesheets for this period from db...,c#
49996,How can I gzip my JavaScript and CSS files?,"<p>I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, w...","[javascript, css, apache, http, gzip]",2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,problem gzip prototype lib totaly idea start works find tutorials n't helpful ... folder files c...,problem gzip prototype lib totaly idea start work find tutorial n't helpful ... folder file comp...,"< p > i have a problem , i have to gzip a prototype lib , but i totaly have no idea how to do th...",javascript
49997,how to know location of return address on stack c/c++,<p>i have been reading about a function that can overwrite its return address.</p>\n\n<pre><code...,"[c++, c, winapi, x86, stack]",2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,reading function overwrite return address. pre code void foo const char input char buf what extr...,reading function overwrite return address. pre code void foo const char input char buf what extr...,< p > i have been reading about a function that can overwrite its return address. < p > < pre > ...,c++


In [54]:
df_main_tag

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,Body_nobalise,main_tag
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form","<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ...","[javascript, jquery, ajax, jquery-ui, enter]",1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,"<p>On a website I'm working on, when you click sign on, a jquery dialoge modal pops up, but you ...",javascript
2,Manually setting session ID in Express,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...,"[javascript, angularjs, node.js, express, passport.js]",20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,<p>I have an Angular/Express app and am trying to implement some kind of restful auth. The Expre...,javascript
7,"C++ RTTI in a Windows 64-bit VectoredExceptionHandler, MS Visual Studio 2015",<p>I'm working on small Windows Exception handling engine trying to gather maximum information f...,"[c++, visual-studio, exception, x86-64, rtti]",39113168,8,2230,1,2016-08-24 01:44:29,2016-08-26 02:19:16,2,<p>I'm working on small Windows Exception handling engine trying to gather maximum information f...,c++
8,How to prevent the keyboard from popping up on mobile devices?,"<p><a href=""http://api.jqueryui.com/spinner/"" rel=""noreferrer"">http://api.jqueryui.com/spinner/<...","[javascript, jquery, html, css, mobile]",39113558,5,23143,4,2016-08-24 02:39:08,2018-10-05 12:15:51,4,"<p><a href=""http://api.jqueryui.com/spinner/"" rel=""noreferrer"">http://api.jqueryui.com/spinner/<...",javascript
10,How do you do dependency injection with AutoFac and OWIN?,<p>This is for MVC5 and the new pipeline. I cannot find a good example anywhere.</p>\n\n<pre><c...,"[c#, dependency-injection, asp.net-mvc-5, owin, autofac]",20061082,15,18422,1,2013-11-19 00:52:43,2016-05-17 13:10:51,3,<p>This is for MVC5 and the new pipeline. I cannot find a good example anywhere.</p>\n\n<pre><c...,c#
...,...,...,...,...,...,...,...,...,...,...,...,...
49993,Lazy-loading visible items in a Listview,"<p>I have a listview which uses the following code:</p>\n\n<pre><code>&lt;ListView x:Name=""Displ...","[c#, wpf, xaml, data-binding, lazy-loading]",21319143,5,10079,1,2014-01-23 20:57:18,2021-04-27 06:24:54,9,"<p>I have a listview which uses the following code:</p>\n\n<pre><code>&lt;ListView x:Name=""Displ...",c#
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collect...,<p>My IQueryable line is:</p>\n\n<pre><code> // find all timesheets for this period - from db so...,"[c#, asp.net-mvc, linq-to-sql, list, iqueryable]",2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,<p>My IQueryable line is:</p>\n\n<pre><code> // find all timesheets for this period - from db so...,c#
49996,How can I gzip my JavaScript and CSS files?,"<p>I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, w...","[javascript, css, apache, http, gzip]",2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,"<p>I have a problem, I have to gzip a prototype Lib, but i totaly have no idea how to do this, w...",javascript
49997,how to know location of return address on stack c/c++,<p>i have been reading about a function that can overwrite its return address.</p>\n\n<pre><code...,"[c++, c, winapi, x86, stack]",2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,<p>i have been reading about a function that can overwrite its return address.</p>\n\n<pre><code...,c++


In [109]:
df

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,AnswerCount,CreationDate,LastActivityDate,CommentCount,ActivityTime
0,"JQuery - AJAX dialog modal, can't hit enter key to submit form",website im working click sign jquery dialoge modal pop click ok submit form cant hit enter need able enter work also seems like work doesnt im using jquery132js also php file following piece code lttr valigntop alignright styleheight40pxgtlttd gt ltdiv idsigningt lttable stylemargintop4pxmarginright4pxborderstylesolidborderwidth1pxgt lttrgtlttd stylewidth165pxgt ltdivgtltcentergt lta titlesign onclicklogindialogopen hrefjavascriptvoidgtsign inltagtltbgtampnbspampnbsp ampnbspampnbspltbgt lta titlecreate account hrefcreateaccounthtmlgtcreate accountltagt ltcentergtltdivgt lttdgtlttrgt lttablegt ltdivgt lttdgtlttrgt ltdiv idsignin_dialog gt ltdiv idbggt ltlabelgtltspangtemailltspangtltlabelgt ltinput typetext nameemail idemail classdialoginputtextgt ltbrgt ltlabelgtltspangtpasswordltspangtltlabelgt ltinput typepassword namepassword idpassword classdialoginputtextgt ltbrgt ltbrgt ltcentergtltbgtltlabel idlogin_error stylecolorredgtltspangtampnbspltspangtltlabelgtltcentergtltbgt ltdivgt ltdivgt ltscriptgt login_dialogdialog autoopen false width 310 overlay opacity 05 background black modal true button ok function bodyaddclasscurwait sql select client_id user email email0value login_password password0value getbongodataphp tasksqlresulttojson sql sql resultofloginattempt json cancel function thisdialogclose ltscriptgt javascript file following function function logindialogopen login_dialogdialogopen login_dialogkeypressfunctione ewhich 13 bodyaddclasscurwait sql select client_id user email email0value login_password password0value getbongodataphp tasksqlresulttojson sql sql resultofloginattempt json code dont understand isnt working also try login_dialogdialogisopen right opened always returned false oddly enough please help,"[javascript, jquery, ajax, jquery-ui, enter]",1113203,4,7735,3,2009-07-11 08:03:25,2010-09-28 09:27:26,3,444
1,How do I do large non-blocking updates in PostgreSQL?,want large update table postgresql dont need transactional integrity maintained across entire operation know column im changing going written read update want know easy way psql console make type operation faster example let say table called order 35 million row want update order set status null avoid diverted offtopic discussion let assume value status 35 million column currently set nonnull value thus rendering index useless problem statement take long time go effect solely locking changed row locked entire update complete update might take 5 hour whereas something like update order set status null order_id gt 0 order_id lt 1000000 might take 1 minute 35 million row breaking chunk 35 would take 35 minute save 4 hour 25 minute could break even script using pseudocode 0 3500 db_operation update order set status null order_id gt i1000 order_id lt i11000 operation might complete minute rather 35 come im really asking dont want write freaking script break operation every single time want big onetime update like way accomplish want entirely within sql,"[postgresql, transactions, sql-update, plpgsql, dblink]",1113277,83,53458,9,2009-07-11 08:46:42,2023-05-05 05:12:05,4,5045
2,Manually setting session ID in Express,angularexpress app trying implement kind restful auth express app passport standard usernamepass login redis session successful login return session id angular sends every request header problem dont know make express make use session id tried writing reqsessionid middleware success use header query string way send session id along,"[javascript, angularjs, node.js, express, passport.js]",20060800,4,3606,1,2013-11-19 00:27:34,2015-01-08 16:58:59,2,415
3,Include Google Maps API Key in open source project?,okay put google map api key source code publish others could take misuse dont want every developer user get api key type somewhere owner key responsible create new google account project project desktop application objectivec small developer tool would best way make convenient,"[security, api, open-source, google-maps, publish]",1113292,10,1676,1,2009-07-11 09:01:50,2009-07-11 13:19:49,1,0
4,Emacs ido-style shell,command line shell shell customization support emacsstyle ido find file emacs navigate directory extremely quickly using cx cf idomode ideally im looking solution used outside emacs though id open way quickly change directory within eshell buffer,"[bash, shell, emacs, eshell, ido]",1112805,20,2683,7,2009-07-11 02:51:56,2020-01-14 13:00:16,5,3839
...,...,...,...,...,...,...,...,...,...,...,...
49995,Linq filtering an IQueryable<T> (System.Data.Linq.DataQuery) object by a List<T> (System.Collection.Generic.List) object?,iqueryable line find timesheets period db systemdatalinqdataquery var timesheets _timesheetrepositoryfindbyperioddte1 dte2 list line get team ad active directory systemcollectiongenericlist var adusers _aduserrepositorygetmyteamuseridentityname wish show timesheets user timesheet collection present user collection use standard c expression var teamsheets timesheets join user adusers tuser1username equal userfullname select get error iqueryable return selfreferencing constant expression supported recommendation,"[c#, asp.net-mvc, linq-to-sql, list, iqueryable]",2666065,5,6390,2,2010-04-19 08:21:35,2010-04-19 08:47:48,1,0
49996,How can I gzip my JavaScript and CSS files?,problem gzip prototype lib totaly idea start work find tutorial wasnt helpful folder j file compressedjs 1js 2js 3js im calling file test file compressesindexphp ltlink reljavascript typetextjs hrefjstabsjs gt ltlink reljavascript typetextjs hrefjsfbjs gt,"[javascript, css, apache, http, gzip]",2666120,23,50791,6,2010-04-19 08:33:01,2015-10-05 06:22:59,1,1994
49997,how to know location of return address on stack c/c++,reading function overwrite return address void fooconst char input char buf10 extra argument supplied printf cheap trick view stack 8 well see trick look format string printfmy stack look likenpnpnpnpnpn pnn p ie expect pointer pas user input straight secure code public enemy 1 strcpybuf input printfsn buf printfnow stack look likenpnpnpnpnpnpnn sugggested stack would look like address foo 00401000 stack look like 00000000 00000000 7ffdf000 0012ff80 0040108a lt want overwrite return address foo 00410ede question author arbitrarily choose second last value return address foo value added stack bottom top apart function return address value apparently see stack ie isnt filled zero thanks,"[c++, c, winapi, x86, stack]",2666301,5,7400,1,2010-04-19 09:05:09,2010-04-19 10:01:46,2,0
49998,Is it possible to use Firebase Cloud Messaging in iOS app without Apple Developer Program?,wonder possible use firebase cloud messaging io app without apple developer program instance asking whether set certificate apple push notification havent found much information web,"[ios, swift, firebase, apple-push-notifications, firebase-cloud-messaging]",40194149,5,1744,1,2016-10-22 16:00:06,2016-12-30 09:19:53,2,68


In [105]:
df.shape

(50000, 11)

In [106]:
# Initialiser le vectoriseur de mots
vectorizer = CountVectorizer()

# Appliquer le vectoriseur aux questions
X = vectorizer.fit_transform(df['Body'])

# Obtenir la liste des mots (features)
features = vectorizer.get_feature_names()

# Afficher la matrice des fonctionnalités
print("Matrice des fonctionnalités :")
print(X)

# Afficher la liste des mots (features)
print("Liste des mots (features) :")
print(features)



Matrice des fonctionnalités :
  (0, 679278)	1
  (0, 315440)	2
  (0, 685884)	2
  (0, 147713)	2
  (0, 571309)	1
  (0, 342730)	1
  (0, 205513)	1
  (0, 407212)	2
  (0, 486898)	1
  (0, 446112)	2
  (0, 600211)	1
  (0, 254424)	1
  (0, 132309)	1
  (0, 286158)	1
  (0, 226708)	2
  (0, 428084)	1
  (0, 67884)	1
  (0, 685559)	2
  (0, 79689)	3
  (0, 552052)	1
  (0, 359142)	1
  (0, 212966)	1
  (0, 659322)	1
  (0, 342733)	1
  (0, 481001)	1
  :	:
  (49998, 502002)	1
  (49998, 282480)	1
  (49998, 436506)	1
  (49998, 413850)	1
  (49998, 329523)	1
  (49998, 94909)	2
  (49998, 685179)	1
  (49998, 401305)	1
  (49998, 136701)	1
  (49998, 248269)	1
  (49998, 149742)	1
  (49999, 450108)	1
  (49999, 338469)	1
  (49999, 636936)	1
  (49999, 357622)	1
  (49999, 153477)	1
  (49999, 180651)	1
  (49999, 90726)	1
  (49999, 546187)	1
  (49999, 90640)	3
  (49999, 521712)	1
  (49999, 603908)	2
  (49999, 187837)	1
  (49999, 264665)	1
  (49999, 271172)	1
Liste des mots (features) :


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [60]:
df_main_tag['sentence_bow_lem'] = df_main_tag['Body'].apply(lambda x : transform_bow_lem_fct(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main_tag['sentence_bow_lem'] = df_main_tag['Body'].apply(lambda x : transform_bow_lem_fct(x))


In [67]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Création du sac de mots
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df_main_tag['sentence_bow_lem'])

# Division des données en ensembles d'entraînement et de test
X_train, X_test, y_train, y_test = train_test_split(X, df_main_tag['main_tag'], test_size=0.2, random_state=42)

# Définition de la grille des hyperparamètres
param_grid = {'alpha': [0.01, 0.1, 1.0]}

# Entraînement du classificateur bayésien naïf
clf = MultinomialNB()
# Recherche par grille pour l'optimisation des hyperparamètres
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Meilleurs hyperparamètres et meilleur score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Meilleurs hyperparamètres : ", best_params)
print("Meilleur score : {:.2f}%".format(best_score * 100))

# Prédiction sur l'ensemble de test avec les meilleurs hyperparamètres
best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)

# Calcul de l'exactitude (accuracy) du modèle
accuracy = accuracy_score(y_test, y_pred)
print("Exactitude du modèle : {:.2f}%".format(accuracy * 100))

Meilleurs hyperparamètres :  {'alpha': 0.1}
Meilleur score : 76.46%
Exactitude du modèle : 76.01%


In [69]:
from sklearn.metrics.pairwise import cosine_similarity
# Mesure de similarité entre les questions à l'aide du sac de mots
similarity_matrix = cosine_similarity(X)

# Exemple de similarité entre deux questions (indices 0 et 1)
question_1 = X[0]
question_2 = X[1]
similarity = similarity_matrix[0, 1]
print("Similarité entre les questions 0 et 1 : {:.2f}".format(similarity))

Similarité entre les questions 0 et 1 : 0.00


In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Prédiction sur l'ensemble de test avec les meilleurs hyperparamètres
best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)

# Calcul de l'exactitude (accuracy) du modèle
accuracy = accuracy_score(y_test, y_pred)
print("Exactitude du modèle : {:.2f}%".format(accuracy * 100))

# Matrice de confusion
cm = confusion_matrix(y_test, y_pred)
categories = vectorizer.get_feature_names()
n_categories = len(categories)
plt.figure(figsize=(min(n_categories, 20), min(n_categories, 20)))
sns.heatmap(cm, annot=True, fmt='d', cmap='YlGnBu', xticklabels=categories, yticklabels=categories)
plt.xlabel('Prédits')
plt.ylabel('Réels')
plt.title('Matrice de confusion')
plt.show()

# Rapport de classification
classification_rep = classification_report(y_test, y_pred)
print("Rapport de classification :\n", classification_rep)

Exactitude du modèle : 76.01%




KeyboardInterrupt: 

In [None]:
# Matrice de confusion
cm = confusion_matrix(y_test, y_pred)
categories = vectorizer.get_feature_names_out()
n_categories = len(categories)
plt.figure(figsize=(min(n_categories, 20), min(n_categories, 20)))
sns.heatmap(cm, annot=True, fmt='d', cmap='YlGnBu', xticklabels=categories, yticklabels=categories)
plt.xlabel('Prédits')
plt.ylabel('Réels')
plt.title('Matrice de confusion')
plt.show()

# Rapport de classification
classification_rep = classification_report(y_test, y_pred)
print("Rapport de classification :\n", classification_rep)

In [5]:
import enchant

def remove_nonexistent_words(text):
    words = text.split()
    english_dict = enchant.Dict("en_US")

    valid_words = []
    for word in words:
        if english_dict.check(word):
            valid_words.append(word)

    return ' '.join(valid_words)

text = "website im working click sign jquery dialoge modal pop click ok submit form cant hit enter need able enter work also seems like work doesnt im using jquery132js also php file following piece code lttr valigntop alignright styleheight40pxgtlttd gt ltdiv idsigningt lttable stylemargintop4pxmarginright4pxborderstylesolidborderwidth1pxgt lttrgtlttd stylewidth165pxgt ltdivgtltcentergt lta titlesign onclicklogindialogopen hrefjavascriptvoidgtsign inltagtltbgtampnbspampnbsp ampnbspampnbspltbgt lta titlecreate account hrefcreateaccounthtmlgtcreate accountltagt ltcentergtltdivgt lttdgtlttrgt lttablegt ltdivgt lttdgtlttrgt ltdiv idsignin_dialog gt ltdiv idbggt ltlabelgtltspangtemailltspangtltlabelgt ltinput typetext nameemail idemail classdialoginputtextgt ltbrgt ltlabelgtltspangtpasswordltspangtltlabelgt ltinput typepassword namepassword idpassword classdialoginputtextgt ltbrgt ltbrgt ltcentergtltbgtltlabel idlogin_error stylecolorredgtltspangtampnbspltspangtltlabelgtltcentergtltbgt ltdivgt ltdivgt ltscriptgt login_dialogdialog autoopen false width 310 overlay opacity 05 background black modal true button ok function bodyaddclasscurwait sql select client_id user email email0value login_password password0value getbongodataphp tasksqlresulttojson sql sql resultofloginattempt json cancel function thisdialogclose ltscriptgt javascript file following function function logindialogopen login_dialogdialogopen login_dialogkeypressfunctione ewhich 13 bodyaddclasscurwait sql select client_id user email email0value login_password password0value getbongodataphp tasksqlresulttojson sql sql resultofloginattempt json code dont understand isnt working also try login_dialogdialogisopen right opened always returned false oddly enough please help"

text_without_nonexistent_words = remove_nonexistent_words(text)
print(text_without_nonexistent_words)

website working click sign modal pop click submit form cant hit enter need able enter work also seems like work using also file following piece code gt account gt false width 310 overlay opacity 05 background black modal true button function select user email cancel function file following function function 13 select user email code understand working also try right opened always returned false oddly enough please help


In [None]:
df['sentence_bow_lem'] = df['sentence_bow_lem'].apply(remove_nonexistent_words)

Exception ignored in: <function Dict.__del__ at 0x000002BE318B4820>
Traceback (most recent call last):
  File "C:\Users\omira\anaconda3\lib\site-packages\enchant\__init__.py", line 556, in __del__
    self._free()
  File "C:\Users\omira\anaconda3\lib\site-packages\enchant\__init__.py", line 614, in _free
    self._broker._free_dict(self)
  File "C:\Users\omira\anaconda3\lib\site-packages\enchant\__init__.py", line 322, in _free_dict
    self._free_dict_data(dict._this)
  File "C:\Users\omira\anaconda3\lib\site-packages\enchant\__init__.py", line 329, in _free_dict_data
    _e.broker_free_dict(self._this, dict)
KeyboardInterrupt: 


In [None]:
df['sentence_bow_lem'].head(20)