**Table of contents**<a id='toc0_'></a>    
- [Mise en place](#toc1_)    
  - [Fonctions et classes](#toc1_1_)    
  - [Récupération des données](#toc1_2_)    
- [Nettoyage des données](#toc2_)    
  - [Langages de programmation de Wikipedia](#toc2_1_)    
  - [xxxxxxxxxxxxxxx](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Mise en place](#toc0_)

In [223]:
import numpy as np
import pandas as pd
import re

# for web scraping
import urllib.request, json 
from html.parser import HTMLParser

## <a id='toc1_1_'></a>[Fonctions et classes](#toc0_)

🚧 à modulariser

In [224]:
class LangParser(HTMLParser):
    """Parse names from an extracted HTML"""
    def __init__(self):
        HTMLParser.__init__(self)
        self.recording = 0
        self.data = set()

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            self.recording = 1

    def handle_endtag(self, tag):
        if tag == "a":
            self.recording = 0

    def handle_data(self, data):
        if self.recording:
            # avoid adding edit links
            if data.strip() != "edit":
                # remove everything between parentheses
                data = re.sub(r'\([^)]*\)', '', data)
                # remove start & end white spaces + convert to lower case
                data = data.strip().lower()
                self.data.add(data)


def get_languages():
    """Return the list of programming languages from Wikipedia"""
    url = "https://en.wikipedia.org/w/api.php?action=parse&page=List_of_programming_languages&prop=text&format=json&disabletoc=1&formatversion=2"
    with urllib.request.urlopen(url) as response:
        scrap = json.load(response)["parse"]["text"]

    # get all programming languages ever declared on Wikipedia
    start = scrap.find('<h2><span class="mw-headline" id="A')
    end = scrap.find('<h2><span class="mw-headline" id="See_also')
    scrap = scrap[start:end]

    parser = LangParser()
    parser.feed(scrap)
    prog_langs = parser.data
    
    return prog_langs


def clean_string(string, excluded=None):
    """Return a cleaned version of an input string"""
    # remove code tags
    string = re.sub(r"<code>.*?<\/code>", "", string, flags=re.S)
    # remove img tags
    string = re.sub(r"<img.*?>", "", string)
    # remove all html tags
    string = re.sub(r"<.*?>", "", string)
    # remove newlines
    string = re.sub(r"\n", " ", string)
    # lowercase
    string = string.lower()
    # keep only letters, digits, spaces and some useful characters
    string = re.sub(r"[^\w *'#+_.-]", " ", string)
    # remove suspension points
    string = re.sub(r"\.\.\.", " ", string)
    # remove multiple spaces
    string = re.sub(r" +", " ", string)

    # remove digits only tokens
    string = re.sub(r"\s\d+\s", " ", string)
    # remove symbols only tokens
    string = re.sub(r"\s\W\s", " ", string)

    words = string.split()
    for w in words:
        if excluded:
            if w not in excluded:
                # remove 1-letter words
                if len(w) == 1:
                    words.remove(w)
                # remove points at the end of words
                if len(w) > 1 and w[-1] == ".":
                    words[words.index(w)] = w[:-1]
    string = " ".join(words)
    return string



## <a id='toc1_2_'></a>[Récupération des données](#toc0_)

[Requête SQL data.stackexchange.com ici](https://data.stackexchange.com/stackoverflow/query/1824062/top-50k-relevant-questions)

⇒ compromis fait entre qualité maximale et contraintes (≥ 5 tags)
- pour qualité, score ≥ 10 et > 3 réponses

🚧 >=1 réponse : voir cohérence & qualité

- dates jusqu’à 2009, cohérent
- reste à spectre large pour tout type de question

seules features indispensables & RGPD dans la requête

# 🚧 voir cohérence des tags en place

In [225]:
df_raw = pd.read_csv("data/2024-03-11 QueryResults.csv")

# <a id='toc2_'></a>[Nettoyage des données](#toc0_)

In [226]:
# remove useless features
df = df_raw.drop(columns=["Score", "AnswerCount", "CreationDate"])

In [227]:
# change Tags string to list
df["Tags"] = df["Tags"].apply(lambda x: x[1:-2].split("><"))

## Termes spécifiques

### Tags existants

In [228]:
def add_tags(x):
    for e in x: tags.add(e)

# use a set to avoid duplicates
tags = set()
df["Tags"].apply(lambda x: add_tags(x))
print(f"Found {len(tags)} existing tags")

Found 22629 existing tags


### <a id='toc2_1_'></a>[Langages de programmation de Wikipedia](#toc0_)

Sur StackOverFlow, les noms de langages informatiques dans les dialogues sont fréquents.  
Non seulement ils ont des syntaxes spécifiques (.QL, C++, S, C#...) facilement éliminées par des traitements de caractères spécifiques, mais ils sont également représentatifs du sujet et font souvent partie des tags.

🚧 [HTML Parser](https://stackoverflow.com/questions/59660495/python-html-extract-text-into-list) : (cf. [question SoF](https://stackoverflow.com/questions/59660495/python-html-extract-text-into-list))  
[RegEx JL](https://www.notion.so/julmat/RegEx-expressions-r-guli-res-70fbbcb177ee476ba9a5ae011d14fe6f) & [RegEx 101](https://regex101.com/?flavor=python&flags=gm)

In [229]:
prog_lang = get_languages()
print(f"Found {len(prog_lang)} programming languages on Wikipedia")

Found 692 programming languages on Wikipedia


### Ensemble complet des termes

In [230]:
# concatenate all specific terms
spec_terms = prog_lang.union(tags)
print(f"Total: {len(spec_terms)} specific terms")

Total: 23184 specific terms


# XXXXXXXXXXXXXXXXXXXXXXXX

In [231]:
df["Body"][3]

"<p>I have functions like the following:</p>\n<pre><code>const char* get_message() {\n    return &quot;This is a constant message, will NOT change forever!&quot;;\n};\n\nconst char* get_message2() {\n    return &quot;message2&quot;;\n};\n</code></pre>\n<p>And I'm planning to use them everywhere my app, even though in different threads.</p>\n<p>I'm wondering about the life time of these strings, i.e. whether it's safe to use these <code>const char*</code> string out of the function <code>get_message</code>.</p>\n<p>I guess that a hard coded <code>const char*</code> string will be compiled into the codes segment of a app instead of the data segments, so maybe it is safely to use them as above?</p>\n"

In [232]:
df["Body"] = df["Body"].apply(clean_string, excluded=spec_terms)
df["Title"] = df["Title"].apply(clean_string, excluded=spec_terms)

In [233]:
df

Unnamed: 0,Title,Body,Tags
0,std shared_mutex unlock_shared blocks even tho...,my team has encountered deadlock that i suspec...,"[c++, windows, multithreading, stl, shared-loc]"
1,what is the correct output of sizeof string,on microcontroller in order to avoid loading s...,"[c, language-lawyer, sizeof, string-literals, ..."
2,problem loading external scripts like jquery,i'm facing problem since this morning with web...,"[javascript, html, jquery, google-apps-script,..."
3,does const char* literal string persistently e...,i have functions like the following and i'm pl...,"[c++, constants, constexpr, lifetime, null-ter..."
4,willpopscope is deprecated in flutter,'willpopscope' is deprecated and shouldn't be ...,"[flutter, dart, mobile, cross-platform, deprec..."
...,...,...,...
49995,the activator for bundle is invalid,i'm trying to create simple plugin in eclipse ...,"[java, eclipse, eclipse-plugin, osgi, osgi-bundl]"
49996,pre-allocate space for c++ stl queue,i'm writing radix sort algorithm using queues ...,"[c++, performance, memory, stl, queu]"
49997,adjusting the positions of the labels in uitab...,i'm using uitableviewcell with uitableviewcell...,"[iphone, objective-c, cocoa-touch, iphone-sdk-..."
49998,padding on stackpanel,i am trying to set padding of stackpanel but t...,"[wpf, xaml, layout, padding, stackpane]"


In [234]:
df["Body"][3]

"i have functions like the following and i'm planning to use them everywhere my app even though in different threads. i'm wondering about the life time of these strings i.e whether it's safe to use these string out of the function i guess that hard coded string will be compiled into the codes segment of app instead of the data segments so maybe it is safely to use them as above"