**Context :**

Stack Overflow is the largest, most trusted online community for developers to learn, share​ ​their programming ​knowledge, and build their careers.

Created in 2008 by Jeff Atwood and Joel Spolsky, Stack Overflow has over 10 million registered users by January 2019 and it exceeded 16 million questions in mid 2018.

The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Reddit.

The website also relying on tags associated with users questions. Tags helps to more easily receive response from the community. But it represent a challenge to new users who does not always which tags to use.

The goal of this project is to work on supervised and unsupervised learning that predict tags to use, base on title and body texts.

# Librairies

In [None]:
import pandas as pd   
import warnings
warnings.filterwarnings("ignore")

from bs4 import BeautifulSoup
import re

import nltk
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords

import missingno as msno

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Loading Data

Data are collected from stackexchange explorer : https://data.stackexchange.com/stackoverflow/query/new.

To do : have all questions on one data frame, check for duplicates.

In [None]:
#Listing all files

file_list= ["/content/drive/MyDrive/DATASETS/ML/P5/QueryResults_jan_19.csv",
            "/content/drive/MyDrive/DATASETS/ML/P5/QueryResults_avr_19.csv",
            "/content/drive/MyDrive/DATASETS/ML/P5/QueryResults_sept_19.csv",
            "/content/drive/MyDrive/DATASETS/ML/P5/QueryResults_mars_2020.csv",
            "/content/drive/MyDrive/DATASETS/ML/P5/QueryResults_dec_2020.csv"]

# creating empty df object            
data = pd.DataFrame()            

# looping throught file list load all file in emty df
for file in file_list:
  temp = pd.read_csv(file)
  cols = temp.columns
  data = pd.concat([data, temp])
  data.columns = cols

# checking shape
data.shape

(203149, 7)

In [None]:
# looking at columns
data.columns

Index(['id', 'PostTypeId', 'Title', 'Body', 'Tags', 'CreationDate', 'Score'], dtype='object')

In [None]:
# keeping only columns of interest
data = data[['Title', 'Body', 'Tags','CreationDate']]

# droppping duplicates
data.drop_duplicates(inplace=True)

# checking shape again
data.shape

(202861, 4)

In [None]:
# reset index and taking a look at raw data
data.reset_index(inplace=True, drop=True)
data.head(5)

Unnamed: 0,Title,Body,Tags,CreationDate
0,How to rearrange subplots so that one is under...,<p>I am trying to code two plots such that one...,<python><matplotlib><subplot>,2019-01-01 00:05:48
1,perl6 How to use junction inside regex interpo...,<p>Sometimes I have a long list and I would li...,<regex><interpolation><raku><junction>,2019-01-01 00:12:59
2,How to set size/rotate image in jekyll?,<p>How to set size of image in jekyll markdown...,<html><markdown><jekyll><github-pages>,2019-01-01 00:26:29
3,Scons appending a random '1' to macro definiti...,<p>I have a command line argument that defines...,<c++><macos><g++><scons>,2019-01-01 00:37:31
4,pyenv failed to download a existing version of...,<p>I recently installed pyenv and attempted to...,<python><macos><pyenv>,2019-01-01 00:44:43


# Data Cleaning and Text Preprocessing

In [None]:
# looking at a title sample
data["Title"][100]

'Cannot create PhoneAuthCredential without verificationProof?'

In [None]:
# looking at body sample
data["Body"][100]

'<p>I have followed the guideline of firebase docs to implement login into my app but there is a problem while signup, the app is crashing and the catlog showing the following erros :</p>\n\n<pre><code>Process: app, PID: 12830\n    java.lang.IllegalArgumentException: Cannot create PhoneAuthCredential without either verificationProof, sessionInfo, ortemprary proof.\n        at com.google.android.gms.common.internal.Preconditions.checkArgument(Unknown Source)\n        at com.google.firebase.auth.PhoneAuthCredential.&lt;init&gt;(Unknown Source)\n        at com.google.firebase.auth.PhoneAuthProvider.getCredential(Unknown Source)\n        at app.MainActivity.verifyPhoneNumberWithCode(MainActivity.java:132)\n        at app.MainActivity.onClick(MainActivity.java:110)\n        at android.view.View.performClick(View.java:4803)\n        at android.view.View$PerformClick.run(View.java:20102)\n        at android.os.Handler.handleCallback(Handler.java:810)\n        at android.os.Handler.dispatchMes

In [None]:
# getting raw text for testing cleaning and pre-processing
raw_text = data["Body"][100]
wordnet_lemmatizer = WordNetLemmatizer()
# Function to convert a raw text to a string of words
# The input is a single string (a raw text), and 
# the output is a single string (a preprocessed text)
#
# 1. Remove HTML
text = BeautifulSoup(raw_text).get_text() 
print(text)
#
# 2. Remove non-letters        
letters_only = re.sub("[^a-zA-Z]", " ", text) 
print(letters_only)
#
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()        
print(words)

lemmatized_words = [wordnet_lemmatizer.lemmatize(w) for w in words] 
print(lemmatized_words)

# 4. searching a set is much faster than searching a list, so convert the stop words to a set
stops = set(stopwords.words("english"))                  
# 
# 5. Remove stop words
meaningful_words = [w for w in lemmatized_words if not w in stops]

print(meaningful_words)


I have followed the guideline of firebase docs to implement login into my app but there is a problem while signup, the app is crashing and the catlog showing the following erros :
Process: app, PID: 12830
    java.lang.IllegalArgumentException: Cannot create PhoneAuthCredential without either verificationProof, sessionInfo, ortemprary proof.
        at com.google.android.gms.common.internal.Preconditions.checkArgument(Unknown Source)
        at com.google.firebase.auth.PhoneAuthCredential.<init>(Unknown Source)
        at com.google.firebase.auth.PhoneAuthProvider.getCredential(Unknown Source)
        at app.MainActivity.verifyPhoneNumberWithCode(MainActivity.java:132)
        at app.MainActivity.onClick(MainActivity.java:110)
        at android.view.View.performClick(View.java:4803)
        at android.view.View$PerformClick.run(View.java:20102)
        at android.os.Handler.handleCallback(Handler.java:810)
        at android.os.Handler.dispatchMessage(Handler.java:99)
        at andro

## Body & Title

In [None]:
# creating function with previously tested steps
def text_to_words( raw_text ):
    wordnet_lemmatizer = WordNetLemmatizer()
    # Function to convert a raw text to a string of words
    # The input is a single string (a raw text), and 
    # the output is a single string (a preprocessed text)
    
    text = BeautifulSoup(raw_text).get_text() 
       
    letters_only = re.sub("[^a-zA-Z]", " ", text) 
   
    words = letters_only.lower().split()        

    lemmatized_words = [wordnet_lemmatizer.lemmatize(w) for w in words] 
    
    stops = set(stopwords.words("english"))                  
    
    meaningful_words = [w for w in lemmatized_words if not w in stops]   
   
    return( " ".join( meaningful_words)) 

In [None]:
# testing fucntion
clean_text = text_to_words( data["Body"][0] )
print(clean_text)

trying code two plot one plot underneath however code keep aligning two plot next one another code import numpy np scipy integrate import odeint numpy import sin co pi array import matplotlib matplotlib import rcparams import matplotlib pyplot plt pylab import figure ax title show import xlsxwriter plt style use ggplot def deriv z l unextended length spring mass bob kg k spring constant nm g gravitational acceleration x dxdt dydt z dx dt l x dydt k x g co dy dt g sin dxdt dydt l x equation motion return np array dxdt dydt dx dt dy dt init array pi initial condition x xdot ydot time np linspace time interval start end number interval sol odeint deriv init time solving equation motion x sol sol fig ax ax plt subplots sharex true ax plot time x ax set ylabel hi ax plot time ax set ylabel fds plt plot keep getting result tried plt subplot x plt subplot plt show run error traceback recent call last file user cnoxon desktop python final code copy py line module plt subplot x file library fra

In [None]:
# Get the number of questions based on the dataframe column size
num = len(data["Body"])

# Initialize an empty list to hold the clean body
clean_text = []

# Loop over each question 
for i in range( 0, num ):
    # Call our function for each one, and add the result to the list
    clean_text.append( text_to_words( data["Body"][i] ) )

# create clean body
data["Body_clean"] = clean_text

In [None]:
# Get the number of questions based on the dataframe column size
num = len(data["Title"])

# Initialize an empty list to hold the clean title
clean_text = []

# Loop over each question
for i in range( 0, num ):
    # Call our function for each one, and add the result to the list 
    clean_text.append( text_to_words( data["Title"][i] ) )

# create clean title
data["Title_clean"] = clean_text


## Tags

In [None]:
# same function as for title and body. Only dropped htlm code removing part

def tags_to_words( raw_text ):
         
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text) 
 
    words = letters_only.lower().split()                             
    
    stops = set(stopwords.words("english"))                  
   
    meaningful_words = [w for w in words if not w in stops]   
    
    return( " ".join( meaningful_words )) 

In [None]:
# Get the number of reviews based on the dataframe column size
num = len(data["Tags"])

# Initialize an empty list to hold the clean tags
clean_text = []

for i in range( 0, num ):
    # Call our function for each one, and add the result to the list 
    clean_text.append( tags_to_words( data["Tags"][i] ) )

# create clean tags    
data["Tags_clean"] = clean_text

In [None]:
data.head(5)

Unnamed: 0,Title,Body,Tags,CreationDate,Body_clean,Title_clean,Tags_clean
0,How to rearrange subplots so that one is under...,<p>I am trying to code two plots such that one...,<python><matplotlib><subplot>,2019-01-01 00:05:48,trying code two plot one plot underneath howev...,rearrange subplots one underneath,python matplotlib subplot
1,perl6 How to use junction inside regex interpo...,<p>Sometimes I have a long list and I would li...,<regex><interpolation><raku><junction>,2019-01-01 00:12:59,sometimes long list would like check whether s...,perl use junction inside regex interpolation,regex interpolation raku junction
2,How to set size/rotate image in jekyll?,<p>How to set size of image in jekyll markdown...,<html><markdown><jekyll><github-pages>,2019-01-01 00:26:29,set size image jekyll markdown steam fish asse...,set size rotate image jekyll,html markdown jekyll github pages
3,Scons appending a random '1' to macro definiti...,<p>I have a command line argument that defines...,<c++><macos><g++><scons>,2019-01-01 00:37:31,command line argument defines type use vector ...,scons appending random macro definition osx,c macos g scons
4,pyenv failed to download a existing version of...,<p>I recently installed pyenv and attempted to...,<python><macos><pyenv>,2019-01-01 00:44:43,recently installed pyenv attempted install ver...,pyenv failed download existing version python,python macos pyenv


In [None]:
data["full"]= data["Title_clean"]+ data["Body_clean"]
data["full"]

'rearrange subplots one underneathtrying code two plot one plot underneath however code keep aligning two plot next one another code import numpy np scipy integrate import odeint numpy import sin co pi array import matplotlib matplotlib import rcparams import matplotlib pyplot plt pylab import figure ax title show import xlsxwriter plt style use ggplot def deriv z l unextended length spring mass bob kg k spring constant nm g gravitational acceleration x dxdt dydt z dx dt l x dydt k x g co dy dt g sin dxdt dydt l x equation motion return np array dxdt dydt dx dt dy dt init array pi initial condition x xdot ydot time np linspace time interval start end number interval sol odeint deriv init time solving equation motion x sol sol fig ax ax plt subplots sharex true ax plot time x ax set ylabel hi ax plot time ax set ylabel fds plt plot keep getting result tried plt subplot x plt subplot plt show run error traceback recent call last file user cnoxon desktop python final code copy py line mod

In [None]:
# saving data for exploration
data_clean = data
data_clean.to_csv( "/content/drive/MyDrive/DATASETS/ML/P5/data_clean.csv",index=False)