# Kernel Start

<div id="start"> 

</div>

# Overview 

The objective of this notebook is to investigate the nature of the provided datasets and provide a simple dedicated EDA for each of the different datasets.

Each subsection will start with a summary of all the findings from the EDA for the particular dataset followed by all the visualisations and computations that have led to those observations.

The hyperlinks to the EDAs for the different datasets (13 different datasets) are as follows:

* [Professionals](#Prof)
* [School Memberships](#Sch)
* [Matches](#Mtc)
* [Tag Questions](#TagQ)
* [Tags](#Tag)
* [Answers](#Ans)
* [Emails](#Ems)
* [Students](#Stud)
* [Questions](#Ques)
* [Groups](#Grou)
* [Group memberships](#Groum)
* [Comments](#comms)
* [Tagged User](#Tagu)

Any comments, recommendations, upvote or suggestions are much appreciated!

*Credits*: The main python code for the EDA of the different datasets has been adapted from my [kernel on the Quora Insincere Questions Classification Challenge](https://www.kaggle.com/spurryag/beginner-attempt-at-nlp-workflow), which has been in turn inspired by various kernels for that kaggle competition. The linked kernel contains the original references to the authors of the adapted code for the Quora Insincere Questions Classification challenge.

Other kernels that have been leveraged in this notebook are:

https://www.kaggle.com/anu0012/quick-start-eda-careervillage-org *

**Hyperlink to end of kernel:**

[End of Kernel](#End)


# Methodology

Given the NLP nature of the datasets, similar exploration methods will be applied to each dataset to unveil their distinct characteristics.  This will include using visualisations such as:

1) **Frequency Tables**

A frequency table is a method of organizing raw data in a compact form by displaying a series of scores in ascending or descending order, together with their frequencies—the number of times each score occurs in the respective data set.

2) **Barplots **

A Barplot indicates the relationship between a categorical variable and a numerical variable. 

3) ** Word Clouds**

Word clouds can identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. Frequently used keywords stand out better in a word cloud. Common words that might be overlooked in tabular form are highlighted in larger text making them pop out when displayed in a word cloud.

4) ** N-Gram (Unigram, Bigram and Trigram)**

An n-gram is a contiguous sequence of n items from a given sample of text or speech. Different definitions of n-grams will allow for the identification of the most prevalent words/sentences in the training data and thus help distinguish what comprises insincere and sincere questions.

It should be noted that prior to displaying individual words or sentences, the text will first be tokenized (based on a desired integer) and then put into a dataframe which will be used to construct side by side plots. Tokenization is, generally, an early step in the NLP process, a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc.

5) **Box plots for Word count distribution and Stop words.**

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It indicates how the outliying values in a dataset are distributed and what their values are. It can also be used to examine the distribution of the data and its skewness.


In [None]:
#Import the different libraries 
import os
print(os.listdir("../input")) #display the available files for analysis
import pandas as pd
from pandas import DataFrame
import numpy as np
import seaborn as sns
from collections import defaultdict
import re
from bs4 import BeautifulSoup


In [None]:
#Code for wordcloud (adapted for removal of stop words)

#Code adpted from : https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc

#import the wordcloud package
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

#Define the word cloud function with a max of 200 words
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(10,10), 
                   title = None, title_size=20, image_color=False):
    stopwords = set(STOPWORDS)
    #define additional stop words that are not contained in the dictionary
    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
    stopwords = stopwords.union(more_stopwords)
    #Generate the word cloud
    wordcloud = WordCloud(background_color='black',
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=800, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    #set the plot parameters
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  

In [None]:
#ngram function
def ngram_extractor(text, n_gram):
    token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]

# Function to generate a dataframe with n_gram and top max_row frequencies
def generate_ngrams(df, n_gram, max_row):
    temp_dict = defaultdict(int)
    for question in df:
        for word in ngram_extractor(question, n_gram):
            temp_dict[word] += 1
    temp_df = pd.DataFrame(sorted(temp_dict.items(), key=lambda x: x[1])[::-1]).head(max_row)
    temp_df.columns = ["word", "wordcount"]
    return temp_df

#Function to construct side by side comparison plots
def comparison_plot(df_1,df_2,col_1,col_2, space):
    fig, ax = plt.subplots(1, 2, figsize=(20,10))
    
    sns.barplot(x=col_2, y=col_1, data=df_1, ax=ax[0], color="royalblue")
    sns.barplot(x=col_2, y=col_1, data=df_2, ax=ax[1], color="royalblue")

    ax[0].set_xlabel('Word count', size=14)
    ax[0].set_ylabel('Words', size=14)
    ax[0].set_title('Top words in sincere questions', size=18)

    ax[1].set_xlabel('Word count', size=14)
    ax[1].set_ylabel('Words', size=14)
    ax[1].set_title('Top words in insincere questions', size=18)

    fig.subplots_adjust(wspace=space)
    
    plt.show()

## *Professionals Dataset*

<div id="Prof"> 

</div>

[Go back to start of kernel](#start)

 ### **Professionals Dataset Summary**
 
It is noted based on the below that there are 28,152 entries in the dataset, with a number of missing values (ranging from 7 to 11% of missing value per field, in the below section).  The oldest professional joined in 2011 while the most recent joiner dates to 2019. Across the top 5 headlines, it is noted that the majority of professionals operate as Solutions consultants, Assurance associates, General Managers, Software Engineers and Project Managers. It is also observed that a number of employees work at PWC across the Professionals dataset. Additionally, the top 5 industries in which the professionals work are Telcommunications, IT, Healthcare, Education and Finance. Finally, the majority of the professionals operate out of the US and India, based on the top 20 locations obtained. 

In [None]:
#Import and view information of the Professionals Dataset
prof = pd.read_csv('../input/professionals.csv')
prof.info()

In [None]:
#Print the start and end dates
print('Oldest Professional join date:',prof.professionals_date_joined.min(), '\n' +'Most Recent Professional join date:',prof.professionals_date_joined.max()) 

### **Frequency Count of missing values in Professionals Dataset**

In [None]:
#Tabulate the count of missing values in the dataset
prof.isnull().sum()

#The location (~11% of missing values), industry (~9% of missing values) and headline (~7% of missing values)columns have missing values 
#missing value % has been calculated as [missing value count/ 28152]

### **Wordclouds for Headlines of Professionals Dataset **

In [None]:
#Select headlines from professionals dataset
prof_headlines = prof["professionals_headline"]
prof_headlines.replace('--', np.nan, inplace=True) 
prof_headlines_na = prof_headlines.dropna()
#run the function on the professional headlines and Remove NA values for clarity of visualisation
plot_wordcloud(prof_headlines_na, title="Word Cloud of Professionals Headlines")

### **Barplots for unique counts of headlines from Professionals Dataset**


In [None]:
#Define a barplot for each
#Below code adapted from: https://www.kaggle.com/anu0012/quick-start-eda-careervillage-org

headlines = prof_headlines_na.value_counts().head(20)
plt.figure(figsize=(12,8))
sns.barplot(headlines.values, headlines.index)
plt.xlabel("Count", fontsize=15)
plt.ylabel("Unique Headlines", fontsize=15)
plt.title("Top 20 Unique Professionals Headlines")
plt.show()

### **Barplots for unique counts of Industries from Professionals Dataset**


In [None]:
prof_industry_na = prof["professionals_industry"].dropna()       
industries = prof_industry_na.value_counts().head(20)
plt.figure(figsize=(12,8))
sns.barplot(industries.values, industries.index)
plt.xlabel("Count", fontsize=15)
plt.ylabel("Unique Industries", fontsize=15)
plt.title("Top 20 Unique Professionals Headlines")
plt.show()

### **Frequency counts of locations for Professionals from Professionals Dataset**


In [None]:
prof_loc_na = prof["professionals_location"].dropna()       
top_loc = prof_loc_na.value_counts().head(20)
print(top_loc)

## *School Memberships Dataset*

<div id="Sch"> 

</div>

[Go back to start of kernel](#start)

 ### **School Memberships Dataset summary**
It is noted based on the below that there are 5,638 entries in the dataset, with no missing values.  The top 5 most frequent school ids are 196700, 200003, 196665, 196883 and 200261.

In [None]:
#Import and view information of the School membership Dataset
school = pd.read_csv('../input/school_memberships.csv')
school.info()
#Check for missing values
school.isnull().sum()

### **Frequency counts of school ids from School memberships dataset**


In [None]:
#Identify the top school IDs
school_id_na = school["school_memberships_school_id"] 
top_school_ids = school_id_na.value_counts().head(20)
print(top_school_ids)

## *Matches Dataset*

<div id="Mtc"> 

</div>

[Go back to start of kernel](#start)

 ### **Matches Dataset summary**
It is noted based on the below that there are 4,316,275  entries in the dataset, with no missing values.  The top 5 email ids which contained the most questions are 569938, 569892, 569829, 569941 and 508675.

In [None]:
#Import and view information of the Matches Dataset
match = pd.read_csv('../input/matches.csv')
match.info()
#Check for missing values
match.isnull().sum()

### **Frequency counts of email ids from Matches Dataset**


In [None]:
#Identify which email ids contained the most questions
top_match_emails = match["matches_email_id"] .value_counts().head(5)
print(top_match_emails)

## *Tag questions dataset*

<div id="TagQ"> 

</div>

[Go back to start of kernel](#start)

 ### **Tag Questions Dataset summary**
It is noted based on the below that there are 76,552  entries in the dataset, with no missing values.  The top 5 occuring hastag to question id pairings are 27490, 129, 89, 54 and 27292.


In [None]:
#Import and view information of the Tags Dataset
tag_ques = pd.read_csv('../input/tag_questions.csv')
tag_ques.info()
#Check for missing values
tag_ques.isnull().sum()

###  **Frequency counts of tag ids from Tag question Dataset**

In [None]:
#Identify which tag ids were the most used
tag_ques_pair = tag_ques["tag_questions_tag_id"] .value_counts().head(5)
print(tag_ques_pair)

## *Tag dataset*

<div id="Tag"> 

</div>

[Go back to start of kernel](#start)

 ### **Tag Dataset summary**
It is noted based on the below that there are 16,269  entries in the dataset, with 1 missing value in the tag name column.  It is noted that there is no single reoccuring tag, implying that all tags are unique.

In [None]:
#Import and view information of the Tags Dataset
tags = pd.read_csv('../input/tags.csv')
tags.info()
#Check for missing values
tags.isnull().sum()

###  **Frequency counts of most used tags from tags dataset**

In [None]:
#Identify which tags were the most used
tags_names = tags["tags_tag_name"] .value_counts().head(5)
print(tags_names)

## *Answers dataset*

<div id="Ans"> 

</div>

[Go back to start of kernel](#start)

**Answers Dataset summary**

It is noted based on the below that there are 51,123  entries in the dataset, with 1 missing value in the answers_body column. It is also seen that the oldest answer dates back to 2011 and the most recent answer dates to 2019. The maximal and minimal number of answers in the dataset by contributor are respectively 1710 and 1, as shown in the frequency table. The word cloud indicates that some major themes of interest relate to success, work, repsonses, college and computing. The n-gram analysis outlines that caeer path, personality, career and university are the main themes of the answers. Finally, the distribution of the stop words and word count appears to be heavily right skewed.

In [None]:
#Import and view information of the Tags Dataset
ans = pd.read_csv('../input/answers.csv')
ans.info()
#Check for missing values
ans.isnull().sum()

In [None]:
#Print the oldest and most recent start date
print('Oldest answer date:',ans.answers_date_added.min(), '\n' +'Most recent answer date:',ans.answers_date_added.max()) 

In [None]:
ans_author_id = ans["answers_author_id"] .value_counts().tail(5) #switch to .head for the top n responses
print(ans_author_id)

###  **Bar plot of unique author ids from Answers Dataset **

In [None]:
#Obtain the unique counts of author ids 
ans_author = ans["answers_author_id"].value_counts().head(20)
#print(ans_author.tail(5))
plt.figure(figsize=(12,8))
sns.barplot(ans_author.values, ans_author.index)
plt.xlabel("Count", fontsize=15)
plt.ylabel("Unique author_id", fontsize=15)
plt.title("Top 20 Unique author_id")
plt.show()

###  **Word cloud of answers_text from Answers Dataset**

In [None]:
#Select headlines from professionals dataset
ans_body = ans["answers_body"]
ans_body_na = ans_body.dropna()
#run the function on the professional headlines and Remove NA values for clarity of visualisation
plot_wordcloud(ans_body_na, title="Word Cloud for Answer body")

###  **n-gram analysis of answers_text from Answers Dataset**

Prior to computing the n-gram analysis, the data will first be preprocessed to remove any html tags, upper case characters, urls and special characters.

In [None]:
#Define empty list
ans_bod_cleaned = []
res = []
#Define for loop to iterate through the elements of the answer_body
for l in ans_body_na:
    #Parse the contents of the cell
    soup = BeautifulSoup(l, 'html.parser')
    #Find all instances of the text within the </p> tag
    for el in soup.find_all('p'):
        res.append(el.get_text())
    #concatenate the strings from the list    
    endstring = ' '.join(map(str, res))
    #reset list
    res = []
    #Append the concatenated string to the main list
    ans_bod_cleaned.append(endstring)

In [None]:
#convert list elements to lower case
ans_body_na_cleaned = [item.lower() for item in ans_bod_cleaned]
#remove html links from list 
ans_body_na_cleaned =  [re.sub(r"http\S+", "", item) for item in ans_body_na_cleaned]
#remove special characters left
ans_body_na_cleaned = [re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", item) for item in ans_body_na_cleaned]
#convert to dataframe and rename the column of the ans_body_na_cleaned list
ans_body_na_clean = pd.DataFrame(np.array(ans_body_na_cleaned).reshape(-1))
ans_body_na_clean.columns = ["ans"]
#Squeeze dataframe to obtain series
answers_cleaned = ans_body_na_clean.squeeze()

In [None]:
#generate unigram
ans_unigram = generate_ngrams(answers_cleaned, 1, 20)

In [None]:
#generate barplot for unigram
plt.figure(figsize=(12,8))
sns.barplot(ans_unigram["wordcount"],ans_unigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Unigrams", fontsize=15)
plt.title("Top 20 Unigrams for Answer body")
plt.show()

In [None]:
#generate bigram
ans_bigram = generate_ngrams(answers_cleaned, 2, 20)

In [None]:
#generate barplot for bigram
plt.figure(figsize=(12,8))
sns.barplot(ans_bigram["wordcount"],ans_bigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Bigrams", fontsize=15)
plt.title("Top 20 Bigrams for Answer body")
plt.show()

In [None]:
#generate trigram
ans_trigram = generate_ngrams(answers_cleaned, 3, 20)

In [None]:
#generate barplot for bigram
plt.figure(figsize=(12,8))
sns.barplot(ans_trigram["wordcount"],ans_trigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Trigrams", fontsize=15)
plt.title("Top 20 Trigrams for Answer body")
plt.show()

###  **Box plots for Word count distribution and Stop words for answer_body from Answers Dataset**

In [None]:
# Number of words in the answers
answers_cleaned["word_count"] = answers_cleaned.apply(lambda x: len(str(x).split()))

fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="word_count", data=answers_cleaned, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Answer body', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Word Count distribution for Answer body', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

In [None]:
# Number of stopwords in answers
answers_cleaned["stop_words_count"] = answers_cleaned.apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="stop_words_count", data=answers_cleaned, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Stop Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Answer body', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Number of Stop Words distribution for Answer body', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

## *Emails dataset*

<div id="Ems"> 

</div>

[Go back to start of kernel](#start)

**Emails Dataset summary**

It is noted based on the below that there are 1,850,101 entries in the dataset with no missing values. It is noted that the oldest email dates back to 2013 and the most recent answer dates to  2019. The different frequencies at which the emails were received are daily, immediate and weekly. Additionally, the most recurring recipient id is 0079e89bf1544926b98310e81315b9f1 with 3496 unique mentions in the dataset. 

In [None]:
#Import and view information of the Tags Dataset
emails = pd.read_csv('../input/emails.csv')
emails.info()
#Check for missing values
emails.isnull().sum()

In [None]:
#Print the oldest and most recent start date
print('Oldest email date:',emails.emails_date_sent.min(), '\n' +'Most recent email date:',emails.emails_date_sent.max()) 

In [None]:
#The different frequencies at which the emails were received
emails["emails_frequency_level"].unique()    

In [None]:
#Identify which recipients were the most solicited
recipient_emails = emails["emails_recipient_id"] .value_counts().head(5)
print(recipient_emails)

## *Students dataset*

<div id="Stud"> 

</div>

[Go back to start of kernel](#start)

**Students Dataset summary**

It is noted based on the below that there are 30,971  entries in the dataset with 2033 missing values in the students_location column. It is noted that the oldest registered student dates back to 2011 and the most recent answer dates to  2019. Most of the students are spread across US based and Indian based locations.

In [None]:
#Import and view information of the Tags Dataset
students = pd.read_csv('../input/students.csv')
students.info()
#Check for missing values
students.isnull().sum()

In [None]:
#Identify where most students are from
students_loc = students["students_location"].dropna()
print(students_loc.value_counts().head(5))

In [None]:
#Print the oldest and most recent start date
print('Oldest student registration date:',students.students_date_joined.min(), '\n' +'Most recent student registration date:',students.students_date_joined.max()) 

## *Questions dataset*

<div id="Ques"> 

</div>

[Go back to start of kernel](#start)

**Questions Dataset summary**

It is noted based on the below that there are 23,931 entries in the dataset, with no missing values. It is also seen that the oldest question dates back to 2011 and the most recent question dates to 2019. The maximal and minimal number of answers in the dataset by an author are respectively 93 and 1, as shown in the frequency table (commented out). In line with the answers dataset, the word cloud indicates that some major themes of interest relate to success, work, repsonses, college and technology. This is in line with the n-gram analysis which outlines that Human Resources, recruitment, personality, career and university are the main themes of the answers. Finally, the distribution of the stop words and word count appears to be heavily right skewed, similar to the answers dataset.

In [None]:
#Import and view information of the Tags Dataset
ques = pd.read_csv('../input/questions.csv')
ques.info()
#Check for missing values
ques.isnull().sum()

In [None]:
#Print the oldest and most recent start date
print('Oldest question date:',ques.questions_date_added.min(), '\n' +'Most recent question date:',ques.questions_date_added.max()) 

In [None]:
#Identify who asked the most questions
ques_auth = ques["questions_author_id"].dropna()
print(ques_auth.value_counts().head(5))

###  **Word cloud of questions_title from Questions Dataset **

In [None]:
#Select headlines from professionals dataset
ques_title = ques["questions_title"]
ques_title_na = ques_title.dropna()
#run the function on the professional headlines and Remove NA values for clarity of visualisation
plot_wordcloud(ques_title_na, title="Word Cloud for Question title")

###  **n-gram analysis of question_title from Questions Dataset**

Prior to computing the n-gram analysis, the data will first be preprocessed to remove any upper case and special characters.

In [None]:
#convert list elements to lower case
quest_title_na_cleaned = [item.lower() for item in ques_title_na]
#remove html links from list 
quest_title_na_cleaned =  [re.sub(r"http\S+", "", item) for item in quest_title_na_cleaned]
#remove special characters left
quest_title_na_cleaned = [re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", item) for item in quest_title_na_cleaned]

In [None]:
#generate unigram
ques_title_unigram = generate_ngrams(quest_title_na_cleaned, 1, 20)

In [None]:
#generate barplot for unigram
plt.figure(figsize=(12,8))
sns.barplot(ques_title_unigram["wordcount"],ques_title_unigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Unigrams", fontsize=15)
plt.title("Top 20 Unigrams for Question title")
plt.show()

In [None]:
#generate bigram
ques_title_bigram = generate_ngrams(quest_title_na_cleaned, 2, 20)

In [None]:
#generate barplot for bigram
plt.figure(figsize=(12,8))
sns.barplot(ques_title_bigram["wordcount"],ques_title_bigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Bigrams", fontsize=15)
plt.title("Top 20 Bigrams for Question title")
plt.show()

In [None]:
#generate trigram
ques_title_trigram = generate_ngrams(quest_title_na_cleaned, 3, 20)

In [None]:
#generate barplot for trigram
plt.figure(figsize=(12,8))
sns.barplot(ques_title_trigram["wordcount"],ques_title_trigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Bigrams", fontsize=15)
plt.title("Top 20 Trigrams for Question title")
plt.show()

###  **Box plots for Word count distribution and Stop words of question_title from Questions dataset**

In [None]:
#convert to dataframe and rename the column of the ans_body_na_cleaned list
ques_title_na_clean = pd.DataFrame(np.array(quest_title_na_cleaned).reshape(-1))
ques_title_na_clean.columns = ["ques_title"]
ques_title_na_clean = ques_title_na_clean.squeeze()

# Number of words in the question_title
ques_title_na_clean["word_count"] = ques_title_na_clean.apply(lambda x: len(str(x).split()))
fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="word_count", data= ques_title_na_clean, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Question title text', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Word Count distribution for Question title', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

In [None]:
# Number of stopwords in question_title
ques_title_na_clean["stop_words_count"] = ques_title_na_clean.apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="stop_words_count", data=answers_cleaned, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Stop Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Question title text', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Number of Stop Words distribution for Question title', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

###  **Word cloud of questions_body from Questions Dataset**

In [None]:
#Select headlines from professionals dataset
ques_bod = ques["questions_body"]
ques_bod_na = ques_bod.dropna()
#run the function on the professional headlines and Remove NA values for clarity of visualisation
plot_wordcloud(ques_bod_na, title="Word Cloud for Question Body")

###  **n-gram analysis of question_title from questions dataset**

Prior to computing the n-gram analysis, the data will first be preprocessed to remove any upper case and special characters.

In [None]:
#convert list elements to lower case
quest_bod_na_cleaned = [item.lower() for item in ques_bod_na]
#remove special characters left
quest_bod_na_cleaned = [re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", item) for item in quest_bod_na_cleaned]

In [None]:
#generate unigram
ques_bod_unigram = generate_ngrams(quest_bod_na_cleaned, 1, 20)

In [None]:
#generate barplot for unigram for question_body
plt.figure(figsize=(12,8))
sns.barplot(ques_bod_unigram["wordcount"],ques_bod_unigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Unigrams", fontsize=15)
plt.title("Top 20 Unigrams for Question Body")
plt.show()

In [None]:
#generate bigram
ques_bod_bigram = generate_ngrams(quest_bod_na_cleaned, 2, 20)

In [None]:
#generate barplot for bigram question_body
plt.figure(figsize=(12,8))
sns.barplot(ques_bod_bigram["wordcount"],ques_bod_bigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Bigrams", fontsize=15)
plt.title("Top 20 Bigrams for Question Body")
plt.show()

In [None]:
#generate trigram
ques_bod_trigram = generate_ngrams(quest_bod_na_cleaned, 3, 20)

In [None]:
#generate barplot for bigram question_body
plt.figure(figsize=(12,8))
sns.barplot(ques_bod_trigram["wordcount"],ques_bod_trigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Trigrams", fontsize=15)
plt.title("Top 20 Trigrams for Question Body")
plt.show()

###  **Box plots for Word count distribution and Stop words of question_body from questions dataset**

In [None]:
#convert to dataframe and rename the column of the ans_body_na_cleaned list
ques_bod_na_clean = pd.DataFrame(np.array(quest_bod_na_cleaned).reshape(-1))
ques_bod_na_clean.columns = ["ques_body"]
ques_bod_na_clean = ques_bod_na_clean.squeeze()

# Number of words in the question_body
ques_bod_na_clean["word_count"] = ques_bod_na_clean.apply(lambda x: len(str(x).split()))
fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="word_count", data= ques_bod_na_clean, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Question Body text', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Word Count distribution for Question Body', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

In [None]:
# Number of stopwords in question_body
ques_bod_na_clean["stop_words_count"] = ques_bod_na_clean.apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="stop_words_count", data=ques_bod_na_clean, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Stop Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Question Body text', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Number of Stop Words distribution for Question Body', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

## *Groups dataset*

<div id="Grou"> 

</div>

[Go back to start of kernel](#start)

**Groups Dataset summary**

It is noted based on the below that there are 49  entries in the dataset with no missing . It is also seen that the youth program is the most popular group type with 33 mentions as compared to competition (1 mention) which is the least popular.

In [None]:
#Import and view information of the Groups Dataset
groups = pd.read_csv('../input/groups.csv')
groups.info()
#Check for missing values
groups.isnull().sum()

In [None]:
#Identify who asked the most questions
group_type = groups["groups_group_type"].dropna()
print(group_type .value_counts().tail(5)) #switch to .head for the top results 

## *Group memberships dataset*

<div id="Groum"> 

</div>

[Go back to start of kernel](#start)

** Group Memberships Dataset summary**

It is noted based on the below that there are 1,038 entries in the dataset with no missing values. It is also seen that the most recurring groups_mem_id has 14 occurances as compared to 1 mention for the least popular one. Additionally, the most reoccuring group_memberships_group_id has 117 mentions and the least reoccuring one has 1 mention.

In [None]:
#Import and view information of the Groups membership Dataset
groups_mem = pd.read_csv('../input/group_memberships.csv')
groups_mem.info()
#Check for missing values
groups_mem.isnull().sum()

In [None]:
#Identify the most popular group_memberships_user_id     
groups_mem_id = groups_mem["group_memberships_user_id"].dropna()
print(groups_mem_id .value_counts().head(5))

In [None]:
#Identify the most popular group_memberships_group_id
groups_mem_g_id = groups_mem["group_memberships_group_id"].dropna()
print(groups_mem_g_id .value_counts().head(5))

## *Comments dataset*

<div id="comms"> 

</div>

[Go back to start of kernel](#start)

**Comments Dataset summary**

It is noted based on the below that there are 14,966   entries in the dataset with 4 missing values in the comments_body column. It is also seen that the oldest comment dates back to 2011 and the most recent one dates back to 2019. The word clould indicates that the themes of the comments revolve around thanking users for their helpful advice. This is supported by the n-gram analysis which outlines that encouragements and thanks are the most prevalent themes in the comments. Similar to the above datasets, the comments_body has a right skewed distribution, albeit wider than the others.

In [None]:
#Import and view information of the comments Dataset
comms = pd.read_csv('../input/comments.csv')
comms.info()
#Check for missing values
comms.isnull().sum()

In [None]:
#Identify the most popular commenters (comments_author_id)
comms_authors = comms["comments_author_id"].dropna()
print(comms_authors .value_counts().head(5))

In [None]:
#Print the oldest and most recent comment dates
print('Oldest comment date:',comms.comments_date_added.min(), '\n' +'Most comment date:',comms.comments_date_added.max()) 

###  **Word cloud of comments_body from comments dataset**


In [None]:
#select comments text from comments dataste
comms_bod = comms["comments_body"]
comms_bod_na = comms_bod.dropna()
#run the function on the professional headlines and Remove NA values for clarity of visualisation
plot_wordcloud(comms_bod_na, title="Word Cloud of Comments body")

###  **n-gram analysis of comments_body from comments dataset**

In [None]:
#convert list elements to lower case
comms_bod_na_cleaned = [item.lower() for item in comms_bod_na]
#remove html links from list 
comms_bod_na_cleaned =  [re.sub(r"http\S+", "", item) for item in comms_bod_na_cleaned]
#remove special characters left
comms_bod_na_cleaned = [re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", item) for item in comms_bod_na_cleaned]

In [None]:
#generate unigram from comments body
comms_bod_unigram = generate_ngrams(comms_bod_na_cleaned, 1, 20)

In [None]:
#generate barplot for unigram from comments body
plt.figure(figsize=(12,8))
sns.barplot(comms_bod_unigram["wordcount"],comms_bod_unigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Unigrams", fontsize=15)
plt.title("Top 20 unigrams from Comments Body")
plt.show()

In [None]:
#generate bigram from comments body
comms_bod_bigram = generate_ngrams(comms_bod_na_cleaned, 2, 20)

In [None]:
#generate barplot for bigram
plt.figure(figsize=(12,8))
sns.barplot(comms_bod_bigram["wordcount"],comms_bod_bigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Bigrams", fontsize=15)
plt.title("Top 20 Bigrams from Comments Body")
plt.show()

In [None]:
#generate trigram from comments body
comms_bod_trigram = generate_ngrams(comms_bod_na_cleaned, 3, 20)

In [None]:
#generate barplot for bigram
plt.figure(figsize=(12,8))
sns.barplot(comms_bod_trigram["wordcount"],comms_bod_trigram["word"])
plt.xlabel("Word Count", fontsize=15)
plt.ylabel("Bigrams", fontsize=15)
plt.title("Top 20 Trigrams from Comments Body")
plt.show()

###  **Box plots for Word count distribution and Stop words of question_body from comments dataset**

In [None]:
#convert to dataframe and rename the column of the ans_body_na_cleaned list
comms_bod_na_clean = pd.DataFrame(np.array(comms_bod_na_cleaned).reshape(-1))
comms_bod_na_clean.columns = ["ques_body"]
comms_bod_na_clean = comms_bod_na_clean.squeeze()

# Number of words in the question_body
comms_bod_na_clean["word_count"] = comms_bod_na_clean.apply(lambda x: len(str(x).split()))
fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="word_count", data= comms_bod_na_clean, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Comment Body text', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Word Count distribution for Comments Body', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

In [None]:
# Number of stopwords in question_body
comms_bod_na_clean["stop_words_count"] = comms_bod_na_clean.apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
fig, ax = plt.subplots(figsize=(15,2))
sns.boxplot(x="stop_words_count", data=comms_bod_na_clean, ax=ax, palette=sns.color_palette("RdYlGn_r", 10), orient='h')
ax.set_xlabel('Stop Word Count', size=10, color="#0D47A1")
ax.set_ylabel('Comments Body text', size=10, color="#0D47A1")
ax.set_title('[Horizontal Box Plot] Number of Stop Words distribution for Comments Body', size=12, color="#0D47A1")
plt.gca().xaxis.grid(True)
plt.show()

## *Tagged User Dataset*

<div id="Tagu"> 

</div>

[Go back to start of kernel](#start)

**Tagged User Dataset summary**

It is noted based on the below that there are 136,663   entries in the dataset with no missing values. It is also seen that the most popular tag_users_tag_id has 3135 instances as compared to the least occuring one with 1 mention. Additionally, the least popular tag_users_user_id  stands at 1 mention and the most popular one with 82 mentions. 

In [None]:
#Import and view information of the comments Dataset
tags = pd.read_csv('../input/tag_users.csv')
tags.info()
#Check for missing values
tags.isnull().sum()

In [None]:
#Identify the most popular tag_user_tag_id
tag_user = tags["tag_users_tag_id"].dropna()
print(tag_user.value_counts().head(5))

In [None]:
#Identify the most popular tag_user_tag_id
tag_user_id = tags["tag_users_user_id"].dropna()
print(tag_user_id.value_counts().head(5))

## *End of Kernel*

<div id="End"> 

</div>

[Go back to start of kernel](#start)

Thanks a lot for having gone through this attempt at conducting an EDA for the careeradvice.org dataset.

Any comments, recommendations, upvotes and insights would be very appreciated!