## Guided Project: Popular Data Science Questions
The goal of this project is to find content that is interesting for designing a data science education service. We'll be looking through the [Data Science Stack Exhange](https://datascience.stackexchange.com/) website for popular subjects that interests people. With the subjects we find, we can potentially include them when we create content for our service.

## Stack Exchange

__What kind of questions are welcomed on this site?__

In the help center of the DSSE website, we can read that we should:

- Avoid asking subjective questions
- Ask practical questions about Data Science.
- Ask specific questions and reasonably scoped questions.
- Make questions relevant to others.

These types of questions will allow for us to sufficiently see what kind of questions people are wanting to learn about, and will be useful for our goal.

__What, other than questions, does the site's home subdivide into?__

Besides the [home](https://datascience.stackexchange.com/), we can see that DSSE is divided into four other sections which includes:
- [Questions](https://datascience.stackexchange.com/questions)

- [Tags](https://datascience.stackexchange.com/tags) to sub-topics that includes:
    - Machine learning
    - Python
    - Neural network
    - Deep learning
    - Classification
    - Keras
    - And many more...

- [Users](https://datascience.stackexchange.com/users)
- [Unanswered](https://datascience.stackexchange.com/unanswered)

The most useful section may be the `Tags` section, while the least useful sections are the `Unanswered` and `Users` sections.

__ What information is available in each post?__

Looking at the most upvoted topics in DSSE, I can see that a `User` can post an in-depth question asking for guidance or explanations about a problem they wish to know more about, more explicitly, a problem in which they are working on.

Each post includes:
- Post Title
- Post Author
- Question(s)
- Post Score
- Other User's responses
- Date of the post
- The last active time on post
- How much the post has been viewed
- How many times the post has been Favorited

## Stack Exchange Data Explorer

In order to find data we need about DSSE, we will be using the [Stack Exhange Data Explorer](https://data.stackexchange.com/datascience/query/new). The data explorer is a website that allows you to query into the DSSE databse using [Transact-SQL](https://en.wikipedia.org/wiki/Transact-SQL). While exploring and experimenting with the Data Explorer, I believe the most useful table to query is the `Posts` table.

The `Posts` Table includes:
- Id
- PostTypeId 
- Score
- ViewCount
- Tags
- CreationDate
- AnswerCount
- FavoriteCount

These are the columns I believe will be the most useful in achieving the goal. `PostTypeId` can help us determine Between a question or an answer. `Score`, `ViewCount`, `Title`, `AnswerCount`, `FavoriteCount` will give us a popularity gauge. `Tags` will show us sub-topics about that are ppopular within a post. `CreationDate` will help us determine the time frame in which these questions will ask so we will be able to see what are popular questions as of late.

## Getting the Data

Running the following code below in Data Explorer, will provide the data we need to accomplish our goal.

`SELECT 
  Id,
  CreationDate,
  Score,
  ViewCount,
  Tags,
  AnswerCount,
  FavoriteCount
FROM posts
WHERE PostTypeId = 2 AND YEAR(CreationDate) = 2019;`

We can save the results of this query from the Data Explorer as a .csv file so we can explore the data.

## Exploring the Data

In [1]:
# Read in libraries to be used
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])

In [2]:
# Explore/Observe the Dataset
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB


There is 72% of missing data from the `FavoriteCount` column, which also seems to be the only column with missing data. A missing valuein this may indicate that the question is not present in any person's favorite list. We can replace the missing values with a zero, and change the column type to `int` since it no longer needs to be type `float`.

The `Tags` column as data type object, let's determine what what `types` the `objects` in `questions["Tag"]` are.

In [3]:
questions["Tags"].apply(lambda value: type(value)).unique()

array([<class 'str'>], dtype=object)

The `Tags` column are all type string. Each post in Stack exchange are limited to a maximum of five tags. In our dataframe, we could separate our `Tags` column into five columns (one for each tag), however this method doesn't seem useful so we can keep it as a list of strings.

Fortunately, Stack Exchange's Data explorer provided clean data. Other than these two columns, the rest of the data seems to be of adequate data types and no other data are missing.

## Cleaning the Data

In [4]:
# Fill in the `FavoriteCount` column with zeros
questions.fillna(value ={"FavoriteCount":0}, inplace=True)

# Change the datatype of the `FavoriteCount` column
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)

# Print information on the dataset
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    8839 non-null int64
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 483.5+ KB


The `Tags` column looks like the following:

In [8]:
questions.iloc[0:3, 4]

0                      <machine-learning><data-mining>
1    <machine-learning><regression><linear-regressi...
2         <python><time-series><forecast><forecasting>
Name: Tags, dtype: object

The list of `Tags` contain `<` and `>` as separators. We can transform this list of string to look more suitable.

In [None]:
# replace separators in the Tags columns
questions["Tags"] = questions["Tags"].str.replace("^<|>$","").str.split("><")
questions.sample(5)

## Most Used and Most Viewed Tags

In this section we will look at how man times a tag has been used and viewed

In [None]:
# Create empty dictionary to keep track of count
tag_count = dict()

# Loop to create count for tag_count dict
for tags in questions["Tags"]:
    for tag in tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1

In [None]:
# Change aesthetics of tag_count
tag_count = pd.DataFrame.from_dict(tag_count, orient="index")
tag_count.rename(columns={0:"Count"}, inplace=True)
tag_count.head(7)

In [None]:
# Sort the Count by value
most_used = tag_count.sort_values(by="Count").tail(20)
most_used

In [None]:
# Visualize the tags counted
most_used.plot(kind="barh", figsize=(14,10))

Showing a view of 20 tags is not necessary and some tags may not be as much use. This should be enough tags to accomplish our goal.

For the next part where we want to see the times each tag is viewed, we can use Python's builtin `enumerate()` function. Its utility is well understood by seeing it action.

In [None]:
some_iteration = "Iterate"

for i, c in enumerate(some_iteration):
    print(i, c)

The `enumerate()` function prints the `element` and the `index` of each of them.

In [None]:
# Dictionary for tag view count
tag_view_count = dict()

# Loop through Tags and count the views
for idx, tags in enumerate(questions["Tags"]):
    for tag in tags:
        if tag in tag_view_count:
            tag_view_count[tag] += questions["ViewCount"].iloc[idx]
        else:
            tag_view_count[tag] = 1

# Modify the aesthetic
tag_view_count = pd.DataFrame.from_dict(tag_view_count, orient="index")
tag_view_count.rename(columns={0:"ViewCount"}, inplace=True)

# Sort the data
most_viewed = tag_view_count.sort_values(by="ViewCount").tail(20)

# Create plot of the data
most_viewed.plot(kind="barh", figsize=(14, 10))

Let's view them side by side.

In [None]:
fig, axs = plt.subplots(nrows = 1, ncols=2)
fig.set_size_inches((20,10))
most_used.plot(kind="barh", ax=axs[0], subplots=True)
most_viewed.plot(kind="barh", ax=axs[1], subplots=True)

In [None]:
# Merging the two data of tags together
in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
in_viewed = pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)

In [None]:
print(in_used)

In [None]:
print(in_viewed)

## Relations Between Tags

A way of trying to gauge how pairs of tags are related to each other, is to count how many times teach tag appears with another.

We can first create a list of all the tags.

In [None]:
# List of all tags
all_tags = list(tag_count.index)

In [None]:
# List of all the tags
print(all_tags)

We can now create a dataframe where each row will represent a tag, and each column will also represent a tag.

In [None]:
# Dataframe with index and columns are tags
tag_associations = pd.DataFrame(index=all_tags, columns=all_tags)

# Print out what the Dataframe looks like
tag_associations.iloc[0:4,0:4]

First we fill this dataframe with zeroes and then, for each lists of tags in `questions["Tags"]`, we will increment the intervening tags by one. The result will be a dataframe that for each pair of tags, it tells us how many times they were used together.

In [None]:
tag_associations.fillna(0, inplace=True)

for tags in questions["Tags"]:
    tag_associations.loc[tags, tags] += 1

Let's focus our attention on the most used tags. We'll add some colors to make it easier to talk about the dataframe.

In [None]:
relations_most_used = tag_associations.loc[most_used.index, most_used.index]

def style_cells(x):
    helper_df = pd.DataFrame('', index=x.index, columns=x.columns)
    helper_df.loc["time-series", "r"] = "background-color: yellow"
    helper_df.loc["r", "time-series"] = "background-color: yellow"
    for k in range(helper_df.shape[0]):
        helper_df.iloc[k,k] = "color: blue"
    
    return helper_df

relations_most_used.style.apply(style_cells, axis=None)

The cells highlighted in yellow tell us that `time-series` was used together with `r` 22 times. The values in blue tell us how many times each of the tags was used. We saw earlier that `machine-learning` was used `2693` times and we confirm it in this dataframe.

It is difficult to understand what is going on in this dataframe. We can try to create a heatmap as a visualization to simplify the complexity. But before we do it, let's get rid of the values in blue, otherwise the colors will be too skewed, and it will misrepresent the data in the visualization

In [None]:
# set the blue values to NaN
for i in range(relations_most_used.shape[0]):
    relations_most_used.iloc[i,i] = pd.np.NaN

In [None]:
# Create a heatmap for the tags
plt.figure(figsize=(12,8))
sns.heatmap(relations_most_used, cmap="PuBu", annot=False)

The most used tags also seem to have the strongest relationships, as given by the dark concentration in the bottom right corner. However, this could simply be because each of these tags is used a lot, and so end up being used together a lot without possibly even having any strong relation between each other.

A more intuitive manifestation of this phenomenon is the following. A lot of people buy bread, a lot of people buy toilet paper, so they end up being purchased together a lot, but purchasing one of them doesn't increase the chances of purchasing the other.

Another shortcoming of this attempt is that it only looks at relations between pairs of tags and not between multiple groups of tags. For example, it could be the case that when used together, dataset and scikit-learn have a "strong" relation to pandas, but each by itself doesn't.

## Enter Domain Knowledge

Doing some research and digging deeper, we find that `keras`, `TensorFlow`, and `sci-kit learn` are libraries used in `Python` to employ `deep-learning` (which is a type of `neural network` and and extension of `machine learning`).

If we wanted to create a course around these top tags, we could recommend a course in Python with a focus on machine learning that extends into deep learning for the usage of classification.

## It is just a Fad?