## About the Dataset:

* Total questions collected is 60,000.
* Questions are from 2016 to 2020.
* There are 3 categories involoved: 
    1. HQ: High-quality posts with 30+ score and without a single edit.
    2. LQ_EDIT: Low-quality posts with a negative score and with multiple community edits. However, they still remain open after the edits.
    3. LQ_CLOSE: Low-quality posts that were closed by the community without a single edit.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Import Libraries

In [None]:
def showver(col):
    try:
        print("{} version: {}". format(col.__name__, col.__version__))
    except AttributeError:
        try:
            print("{} version: {}". format(col.__name__, col.version))
        except AttributeError:
            pass
        
import sys #access to system parameters https://docs.python.org/3/library/sys.html
showver(sys)    

import numpy as np # linear algebra
showver(np)

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
showver(pd)

import missingno as miss
showver(miss)

import matplotlib.pyplot as plt
showver(plt)

import squarify
showver(squarify)

import random
showver(random)

import datetime
showver(datetime)

import re
showver(re)

from collections import Counter
showver(Counter)

from nltk.corpus import stopwords #removes and, in, the, a ... etc
showver(stopwords)

import plotly.express as px
showver(px)

from bs4 import BeautifulSoup
showver(BeautifulSoup)

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

print('-' * 43)


## Data Basics

In [None]:
FILEPATH = '/kaggle/input/60k-stack-overflow-questions-with-quality-rate/data.csv'

In [None]:
df = pd.read_csv(FILEPATH)
df.sample(10)

In [None]:
df.describe(include = "all")

In [None]:
df.info()

In [None]:
df.isnull().sum()

Everything is hunky-dory! So, no worries about NA/null data!

### Visual on Null

In [None]:
miss.bar(df)
plt.show()

## 1. Let's find if there is any duplicates. 

In [None]:
df.duplicated(subset=None, keep='first')

In [None]:
len(df[df.duplicated()])

**Observation:**

* We can see that there is no duplicates involved in this dataset which is a good sign!

### Tech Keys

As Tags come as extra html characters, it is hard to understand which tech keys are mostly used. So, we have to clean them up to get the tech keys.


In [None]:
def get_tech_keys(tag):
    
    if(not tag):
        return tag
    
    tag = tag.replace('><', ',')
    
    tag = tag.replace('<', '')
    
    tag = tag.replace('>', '')
    
    return tag

In [None]:
df['TechKeys'] = df['Tags'].apply(get_tech_keys)

In [None]:
df.head()

In [None]:
tech_keys = df['TechKeys'].tolist()

In [None]:
tech_keys

In [None]:
tech_key_list   = []
tech_key_values = None
index_counter = 0
tech_key_index_list = []

for item in tech_keys:
    item_parts = item.split(',')
    
    for item_ in item_parts:
        
        tech_key_index_list.append(index_counter)
        tech_key_list.append(item_)
        index_counter += 1
    
df_tech_key_new = pd.DataFrame({'id' : tech_key_index_list, 'tech_key' : tech_key_list}) 

In [None]:
df_tech_key_new.head()

In [None]:
len(df_tech_key_new)

In [None]:
df_tech_key_new.tech_key.value_counts().nlargest(10)

#### Tags Count

Let's try to add a column by counting tags as this might help us to identify the quality of the question (I am guessing).



In [None]:
def get_tags_counts(col):
    
    if(not col):
        return 0
    
    tags_count = len(col.split(','))
    
    return tags_count

In [None]:
df['TagsCount'] = df['TechKeys'].apply(get_tags_counts)

In [None]:
df.head()

In [None]:
df_sub = df[['Id', 'Title', 'Tags', 'TagsCount']][0:25]

In [None]:
df_sub.head()

In [None]:
def highlight_max_custom(s, color = 'lightgreen'):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: '+color if v else '' for v in is_max]

In [None]:
df_sub.style.apply(highlight_max_custom, color = '#CFFE96',  axis = 0, subset=['TagsCount'])

In [None]:
df.Y.unique()

### Question Level

Let's come up with a new column called level based on the question quality.

* L1 - High Quality 
* L2 - Low Quality but open
* L3 - Low Quality Closed


In [None]:
def get_question_level(level):
    
    if(not level):
        return level
    
    if(level == 'LQ_CLOSE'):
        return 3
    
    if(level == 'LQ_EDIT'):
        return 2
    
    if(level == 'HQ'):
        return 1
    
    return level

In [None]:
df['Level'] = df['Y'].apply(get_question_level)

In [None]:
df.head()

In [None]:
# import matplotlib.pyplot as plt

def show_donut_plot(col):
    
    rating_data = df.groupby(col)[['Id']].count().head(10)
    plt.figure(figsize = (12, 8))
    plt.pie(rating_data[['Id']], autopct = '%1.0f%%', startangle = 140, pctdistance = 1.1, shadow = True)

    # create a center circle for more aesthetics to make it better
    gap = plt.Circle((0, 0), 0.5, fc = 'white')
    fig = plt.gcf()
    fig.gca().add_artist(gap)
    
    plt.axis('equal')
    
    cols = []
    for index, row in rating_data.iterrows():
        cols.append(index)
    plt.legend(cols)
    
    plt.title('Donut Plot: SOF Questions by ' +str(col), loc='center')
    
    plt.show()

In [None]:
show_donut_plot('Level')

**Observation:**

* As the questions are equally divided, we don't have much to say here.

In [None]:
show_donut_plot('TagsCount')

In [None]:
# import squarify

def show_treemap(col, max_labels = 10):
    df_type_series = df.groupby(col)['Id'].count().sort_values(ascending = False).head(20)

    type_sizes = []
    type_labels = []
    for i, v in df_type_series.items():
        type_sizes.append(v)
        
        type_labels.append(str(i) + ' ('+str(v)+')')


    fig, ax = plt.subplots(1, figsize = (12,12))
    squarify.plot(sizes=type_sizes, 
                  label=type_labels[:max_labels],  # show labels for only first 10 items
                  alpha=.2 )
    
    plt.title('TreeMap: SOF Questions by '+ str(col))
    plt.axis('off')
    plt.show()

In [None]:
show_treemap('Level')

In [None]:
# print random body (random column data)
# import random

# df.at[df.index[random.randint(0, len(df))], 'Body']

In [None]:
def code_available(content):
    
    if('<code>' in content):
        return True
    
    return False

In [None]:
df['code_available'] = df['Body'].apply(code_available)

In [None]:
df.head()

In [None]:
show_donut_plot('code_available')

**Observation:**

* Almost half of the contents don't have code in the body.

### Year, Date and Week Columns

Let's create more columns from the `CreationDatetime`. We will have `CreationMonth`, `CreationYear`, and `CreationWeek`

In [None]:
# import datetime

def get_week(col):
    
    return col.strftime("%V")

In [None]:
# Create new columns for Month, Year Created
df['CreationDatetime'] = pd.to_datetime(df['CreationDate']) 
df['CreationMonth'] = df['CreationDatetime'].dt.month.astype(int)
df['CreationYear'] = df['CreationDatetime'].dt.year.astype(int)
df['CreationWeek'] = df['CreationDatetime'].apply(get_week).astype(int)

In [None]:
# df.info()

In [None]:
show_donut_plot('CreationMonth')

In [None]:
show_treemap('CreationMonth')

In [None]:
show_donut_plot('CreationYear')

In [None]:
show_donut_plot('CreationWeek')

In [None]:
show_treemap('CreationYear')

In [None]:
show_treemap('CreationWeek', 18)

In [None]:
df_tech_key_new

In [None]:
def show_donut_plot_techkey(col):
    
    rating_data = df_tech_key_new.groupby(col)[['id']].count().head(50)
    plt.figure(figsize = (12, 8))
    plt.pie(rating_data[['id']], autopct = '%1.0f%%', startangle = 140, pctdistance = 1.1, shadow = True)

    # create a center circle for more aesthetics to make it better
    gap = plt.Circle((0, 0), 0.5, fc = 'white')
    fig = plt.gcf()
    fig.gca().add_artist(gap)
    
    plt.axis('equal')
    
    cols = []
    for index, row in rating_data.iterrows():
        cols.append(index)
    plt.legend(cols)
    
    plt.title('Donut Plot by ' +str(col), loc='center')
    
    plt.show()

In [None]:
show_donut_plot_techkey('tech_key')

Warning:

There is something in this donut plot which I can't figure it out at the momemnt. However, I will come back and fix them soon.

In [None]:
def show_treemap_tech_key(col):
    df_type_series = df_tech_key_new.groupby(col)['id'].count().sort_values(ascending = False).head(50)

    type_sizes = []
    type_labels = []
    for i, v in df_type_series.items():
        type_sizes.append(v)
        
        type_labels.append(str(i) + ' ('+str(v)+')')


    fig, ax = plt.subplots(1, figsize = (12,12))
    squarify.plot(sizes=type_sizes, 
                  label=type_labels[:25],  # show labels for only first 10 items
                  alpha=.2 )
    plt.title('TreeMap by '+ str(col))
    plt.axis('off')
    plt.show()

In [None]:
show_treemap_tech_key('tech_key')

**Observation:**

* We can clearly see that Javascript and Python are dominating the questions followed by Java.
* We know that Java is lagging behind, not able to compete with Python since 2013.
* I am surprised to see Android comes 4th place. 
* Where is Node.js? Hello?

In [None]:
def show_donut_plot_2cols(col1, col1_val, col2):
    
    df1 = df[df[col1] == col1_val]
    
    rating_data = df1.groupby(col2)[['Id']].count().head(10)
    plt.figure(figsize = (12, 8))
    plt.pie(rating_data[['Id']], autopct = '%1.0f%%', startangle = 140, pctdistance = 1.1, shadow = True)

    # create a center circle for more aesthetics to make it better
    gap = plt.Circle((0, 0), 0.5, fc = 'white')
    fig = plt.gcf()
    fig.gca().add_artist(gap)
    
    plt.axis('equal')
    
    cols = []
    for index, row in rating_data.iterrows():
        cols.append(index)
    plt.legend(cols)
    
    plt.title('Donut Plot by ' +str(col1) + ' and ' +str(col2), loc='center')
    
    plt.show()

In [None]:
show_donut_plot_2cols('CreationYear', 2016, 'Level')

In [None]:
show_donut_plot_2cols('CreationYear', 2016, 'code_available')

In [None]:
df.head()

In [None]:
df.Y.unique()

## Word Cleanup

As the body contains both code and content, we will have to remove code from the content. We will start doing it in the code below.

Also, we will do a little cleaning on the content by removing stop words and less than 3 characters. 

In [None]:
import re

code_start = '<code>'
code_end   = '</code>'

def get_codes(content):
    
    if('<code>' not in content):
        return None
    
    code_list = []
    
    loop_counter = 0
    while(code_start in content):

        code_start_index = content.index(code_start)
        if(code_end not in content):
            code_end_index = len(content)
        else:
            code_end_index = content.index(code_end)

        substring_1 = content[code_start_index : (code_end_index + len(code_end) )]
 
        code_list.append(substring_1)
        
        content = content.replace(substring_1, '')
        
        loop_counter += 1

    
    return ' '.join(code_list)

def  clean_text(content):
    
    content = content.lower()
    
    content = re.sub('<.*?>+', '', content)
    
    content = re.sub(r"(@[A-Za-z0-9]+)|^rt|http.+?", "", content)
    content = re.sub(r"(\w+:\/\/\S+)", "", content)
    content = re.sub(r"([^0-9A-Za-z \t])", " ", content)
    content = re.sub(r"^rt|http.+?", "", content)
    content = re.sub(" +", " ", content)

    # remove numbers
    content = re.sub(r"\d+", "", content)
    
    return content

def get_non_codes(content):
    
    loop_counter = 0
    while(code_start in content):

        code_start_index = content.index(code_start)
        if(code_end not in content):
            code_end_index = len(content)
        else:
            code_end_index = content.index(code_end)

        substring_1 = content[code_start_index : (code_end_index + len(code_end) )]

        content = content.replace(substring_1, ' ')
        
        loop_counter += 1
        
    content = clean_text(content)

    return content

In [None]:
df['Body_code'] = df['Body'].apply(get_codes)
df['Body_content'] = df['Body'].apply(get_non_codes)

In [None]:
# from collections import Counter
# from nltk.corpus import stopwords

stopwords1 = stopwords.words('english')



df['content_words'] = df['Body_content'].apply(lambda x:str(x).split())

In [None]:
def remove_short_words(content):

    new_content_list = []
    for item in content:
        
        if(len(item) > 2):
            new_content_list.append(item)
    
    return new_content_list
    

In [None]:
df['content_words'] = df['content_words'].apply(remove_short_words)

In [None]:
df.head()

In [None]:
words_collection = Counter([item for sublist in df['content_words'] for item in sublist if not item in stopwords1])
freq_word_df = pd.DataFrame(words_collection.most_common(30))
freq_word_df.columns = ['frequently_used_word','count']

freq_word_df.style.background_gradient(cmap='YlGnBu', low=0, high=0, axis=0, subset=None)

In [None]:
# import plotly.express as px

fig = px.scatter(freq_word_df, x="frequently_used_word", y="count", color="count", title = 'Frequently used words - Scatter plot')
fig.show()

In [None]:
fig = px.pie(freq_word_df, values='count', names='frequently_used_word', title='Stackoverflow Questions - Frequently Used Word')
fig.show()

**Observation:**

* `code` and `like` are used most in the questions. Everyone is looking for code huh?

### Question Level Sunburst Plot

In [None]:
fig = px.sunburst(df, path=['CreationYear', 'CreationMonth'], values='Level',
                  color='Level', hover_data=['Level'])
fig.show()

### Code Available Sunburst

In [None]:
fig = px.sunburst(df, path=['CreationYear', 'CreationMonth'], values='code_available',
                  color='code_available', hover_data=['code_available'])
fig.show()

### Code Available Strip Plot

In [None]:
fig = px.strip(df, x="CreationMonth", y="code_available", orientation="h", color="CreationYear")
fig.show()

To Do:

* Introduce Quality Level (L1 - High, L2 - Low but open, L3 - Poor)
* Predict the LQ_Close items based on the question content (like what phrase is used)
* Treemap with categories
* Donut plot with category
* Donut plot with year and category
* Which month too much Poor question

**Final Notes:**

I am adding things still. You can come back and check for more information.

Also, if you **like my notebook**, <font style="color:blue;size:14px;">please upvote it</font> as it will motivate me to come up with better approach in the upcoming notebooks.


<font color="blue" size=+1.5><b>Check out my other kernels</b></font>

<div style="margin-bottom: 20px;">
    &nbsp;
<div style="float:left; margin-right:10px;">
<a href="https://www.kaggle.com/kamalkhumar/amazon-review-prediction-using-spacy" class="btn btn-info" style="color:white;">Amazon review prediction using spaCy</a>
</div>
 
<div style="float:left; margin-right:10px;"> 
<a href="https://www.kaggle.com/kamalkhumar/titanic-prediction" class="btn btn-info" style="color:white;">Titanic Prediction</a>
</div>

<div style="float:left; margin-right:10px;">   
<a href="https://www.kaggle.com/kamalkhumar/loan-status-prediction" class="btn btn-info" style="color:white;">Loan Status Prediction</a>
</div>
</div>
    
<div style="float:left; margin-right:10px;">    
<a href="https://www.kaggle.com/kamalkhumar/kollywood-prediction" class="btn btn-info" style="color:white;">Kollywood Data Prediction</a><br><br>
</div>    

<div style="float:left; margin-right:10px;">    
<a href="https://www.kaggle.com/kamalkhumar/sms-spam-or-not-base" class="btn btn-info" style="color:white;">SMS Spam or Not Prediction</a><br><br>
</div>    