# EDA on Medium blogs

I am blogging for some time on [dev.to](https://dev.to/kedark) and [medium](https://medium.com/@kedarkodgire.kk) and I thought that it would be great to  to analyze some data based on blogs like, what should be the reading time, on what day to post a blog, which image extensions are used in the blogs, etc. I found an interesting dataset on kaggle for medium blogs, this dataset contains information about randomly chosen medium articles published in 2019 from these 7 publications:

Towards Data Science
UX Collective
The Startup
The Writing Cooperative
Data Driven Investor
Better Humans
Better Marketing

I will be using some python libraries like pandas, numpy, etc for this analysis.

In [None]:
project_name = "EDA on Medium blogs" 

In [None]:
pip install --upgrade pip

In [None]:
!pip install jovian --upgrade -q

In [None]:
import jovian

In [None]:
jovian.commit(project=project_name)

## Data Preparation and Cleaning

TODO

In [None]:
import pandas as pd
import numpy as np

In [None]:
medium_df=pd.read_csv('../input/medium-articles-dataset/medium_data.csv')

In [None]:
medium_df

This tells us that we have information for 6508 blog posts written on medium for responses,	reading_time,	publication, etc. Lets have a look at some more information about this dataset using `.info()` method

In [None]:
medium_df.info()

Now that we got the basic info we can say that the columns ID and URL will not be necessary for the analysis. Lets proceed to remove these columns but before that we will make a copy(using `.copy()` method) because it's no recommended to play with original dataset.

In [None]:
medium_df.columns

In [None]:
required_columns = ['title','subtitle', 'image', 'claps', 'responses',
       'reading_time', 'publication', 'date']

In [None]:
medium_df_eda = medium_df[required_columns].copy()

Now that we made a copy let's see if it's copied properly

In [None]:
medium_df.info()
medium_df_eda.info()

Now it's confirmed that we have new dataset to explore and play with. We can see that the date column has datatype object, let's convert it into date so that It can be used into analysis.

In [None]:
medium_df_eda.date = pd.to_datetime(medium_df_eda.date)
medium_df_eda.info()


In [None]:
print(medium_df_eda.date[0])
print(medium_df_eda.date[0].day)
print(medium_df_eda.date[0].month)
print(medium_df_eda.date[0].year)

Wonderful, now that we have converted the datatype we can access the day, month. Lets modify the image column of the data as we only need image type and not the name for analysis.

In [None]:
medium_df_eda.image = medium_df_eda.image.str.replace('[0-9.]','',regex=True)

`replace('[0-9.]','',regex=True)` this peice of code is a regular expression with replace function. The regular expression is used to identify the the digits starting with any digits between 0 to 9 along with `.` and replace it with empty string so we get the desired result. Let's check the number of unique image extensions using `value_counts` function.

In [None]:
medium_df_eda.image.value_counts()

what? we have 2 rows with no image. let's remove these two rows to make this dataset cleaner.

In [None]:
medium_df_eda.image.replace('',np.nan, inplace=True)
medium_df_eda.dropna(subset=['image'], inplace=True)

`medium_df_eda.image.replace('',np.nan, inplace=True)` This line replaces the empty with NaN and `medium_df_eda.dropna(subset=['image'], inplace=True)` drops column with NaN
Let's check if they are removed now.

In [None]:
medium_df_eda.image.value_counts()

And its removed :)
But we also have some upper extensions in upper case let's turn them into lower case

In [None]:
medium_df_eda.image = medium_df_eda.image.str.lower()

In [None]:
medium_df_eda.image.value_counts()

Now we are good let's proceed to visualization

In [None]:
import jovian

In [None]:
jovian.commit(project=project_name)

## Exploratory Analysis and Visualization

Visualizations are interesting isn't it? Lets Do some visualizations now

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Publications 
Now that we imported the required libraries lets get started with the visualizations. We can check the check the number of articles that a publications published using the `value_counts()` function in python 

In [None]:
publication_articles_count = medium_df_eda.publication.value_counts().rename_axis('publications').reset_index(name='counts')
publication_articles_count

In the code above `rename_axis('publications').reset_index(name='counts')` is used to represent the `value_counts()` output as a dataframe so that we should be able to plot the graphs with it. Lets proceed by plotting bargraph on the data. 

In [None]:
plt.figure(figsize=(12,6))
plt.xlabel("number of articles")
plt.title("Number of articles by publications")
sns.barplot(y=publication_articles_count.publications,x=publication_articles_count.counts);

From the above graph we can see that

#### *The Startup* publication has most number of articles i.e. around 3000 and *Better Humans* has the least number of articles published on medium i.e. around 10.

We can also Infer that most of the people read articles from The Startup, Towards Data Science and Data Driven Investor assuming they have posted huge number of articles based on the demand/response fron audience.

(Note: we cannot be completly sure about this Inference because the dataset we have might not be the complete data) 

Now lets try to plot the graph which will tell the day on which maximum/minimum number of articles are published. For this we can plot graph for month vs day_of_week for the number of articles published. Let's create a new dataframe with all the dates and the counts of it.

In [None]:
articles_df = medium_df_eda.date.value_counts().rename_axis('dates').reset_index(name='counts')

In [None]:
articles_df.info()

Cool, as you see we have got the required columns from mail dataframe i.e dates and counts, let's take day of week and month out of it using `dt` and make two new columns for these as we are going to use these for out graph.

In [None]:
articles_df['month'] = articles_df.dates.dt.month_name()
articles_df['day_of_week'] = articles_df.dates.dt.day_name()
articles_df

And there yo go, we got it. Before we plot the heatmap for this analysis we have to convert it into pivot table. More about the pivot table can be found [here](https://en.wikipedia.org/wiki/Pivot_table) and we will be using `pivot_table` method in python

In [None]:
rest_data = articles_df.pivot_table(index='month', columns='day_of_week', values='counts',  aggfunc='sum', fill_value=0)
rest_data = rest_data[['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']]

The second line of the code is used to get weeks in order i.e. mon, tue, etc. Otherwise it will be sorted alphabetically.

In [None]:
rest_data

In [None]:
plt.figure(figsize=(12,6))
plt.title("when were the blogs posted?")
sns.heatmap(rest_data, cmap="Greens", linewidths=.5)

Wow here is our heatmap, amazing, isn't it? 
From this heatmap we can say that most of the blogs were posted on monday and among all the months around 500 articles were published on monday in the month of october. If we consider two week days where number of  articles published highest we get

* Monday
* Thursday

So the next time when you publish your article try to do it on Monday or Thursday hopefully it will get seen by more number of audience.

Now that we understood what is the best day to post an article, let's try to get understand how long should the article be.

we are going to plot the graph with number of claps vs reading time of the article. Assumtion is that a person will give a clap to article only when he reads it completly and finds it useful/entertaining.

In [None]:
plt.figure(figsize=(12,6))
plt.xlabel("Reading time in minutes")
plt.ylabel("number of claps")
sns.scatterplot(medium_df_eda.reading_time,medium_df_eda.claps)

Yay, we got our graph!!
along with the claps the comments/responses are also the important factor in the article writing because they they help us to understand quality of blog i.e is it well-written / it's plagarised / it is helpful / conversation starter, etc. and you can also do further sentimental analysis on it to understand if it's good or bad. let's plot another scatterplot for it.

In [None]:
plt.figure(figsize=(19,6))
plt.xlabel("Reading time in minutes")
plt.ylabel("number of claps")
sns.scatterplot(medium_df_eda.responses, medium_df_eda.reading_time)

From the above scatterplots we can say that

#### The articles with reading time of 5 - around 10 minutes have huge number of claps and have some responses

For these claps we can assume that a person has read and understood the article hence he/she gave a clap to it.  So we can say that ideal reading time for an article should be 5 - 10 minutes. And if the article is too long there is a possibility that the user may not read it as it will take lot of their time and hence will not comment or clap to it.

Images make the articles engaging and attractive and hence they are necessary part of the blog, well atleast the cover image because it can decide wether user will click the link and look at the blog or not. And due to this using the right image extension is also necessary because it will affect the performance of the blog. Because if the webpage takes a long time to load because of it's resources i.e image in this case, people are just going to move on.

so, lets check what image extentions are uses in thes 60k articles dataframe.  Pie chart with percentages will be useful for this analysis so lets plot that by using `value_counts` method

In [None]:
medium_df_eda['image'].value_counts().plot(kind='pie', figsize=(10, 9),  autopct='%1.1f%%')
plt.legend(medium_df_eda.image.unique())
plt.title("Type of images used")

Ohh as we see that most of the images are of of type **jpeg** i.e about 50% and then jpg and png. There is gonna be some reason behind this, let's see what it is

* JPEG - JPEG is a lossy compression method used to ensure the digital images being used are as small as possible and load quickly when someone wants to view them. Here are some important points about it
    * The file size of the image being compressed is permanently reduced by eliminating unnecessary (redundant) information from the image.
    * Image quality does suffer, though it’s often so slight the average site visitor can’t tell.
    
    
* JPG - Well, when it comes to .jpeg vs .jpg, the truth is there is no difference between the two except the number of characters.
    * The term JPG exists because the earlier versions of Windows operating systems. Specifically, the MS-DOS 8.3 and FAT-16 file systems had a maximum 3-letter limit when it came to file names, unlike the UNIX-like operating systems like Mac or Linux, which didn’t have this limit.
    
I read these points in [this article](https://kinsta.com/blog/jpg-vs-jpeg/), you can check it out if you want to learn more about it

well coming to png's

* PNG - Portable Network Graphics (PNGs) are just as popular as JPEGs on websites. They also support millions of colors, although you’re much better off using PNGs for images that contain less color data. Otherwise, your image is going to be ‘heavier’ than the same image saved as a JPEG.

so we can conclude that

1. JPEG/JPG: This is an ideal image format for all types of photographs.
2. PNG: This format is perfect for screenshots and other types of imagery where there’s not a lot of color data.
3. GIF: If you want to show off animated graphics on your site, this is the best image format for you.


In [None]:
import jovian

In [None]:
jovian.commit(project=project_name)

## Asking and Answering Questions

we have gained some insights about the blogs in the dataset and it also helped to understand how your next article should be. Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations if required.

In [None]:
medium_df_eda

### Q: what percent of articles have subtitle after heading, is it necessary? 

Titles play an essential role in determining whether a reader clicks on your story. Medium titles that are not formatted properly can render an article ineligible for curation. This is also true for the article subtitle if you choose to add these elements to your article.

In [None]:
percentage = medium_df_eda.subtitle.count()/len(medium_df_eda) * 100
print("About {} percent of articles have subtitles".format(percentage))

### Q: The dataset that we are using is in which time span

We can answer this using min and max functions

In [None]:
print("The data of blogs we have is from {} to {}".format(medium_df_eda.date.min(), medium_df_eda.date.max()))

So we have data for 2019 starting from 26th january to 30th December

### Q: what is the avarage reading time of articles according to the publications

In [None]:
avg_reading_time_df = medium_df_eda[["publication","reading_time"]].groupby("publication").mean() 

In [None]:
print(avg_reading_time_df)
print("--------------------------------------")
print(medium_df_eda.publication.value_counts())

so from this we can see that most of the articles are written by The Startup publication and their avarage reading time is around 6 minutes and this also backs up our analysis i.e. ideal reading time is 5 - 10 minutes from the scatterplots.

### Q: Which articles among the publications have maximum number of claps?

In [None]:
indexes = medium_df_eda.groupby(['publication'], sort=False)['claps'].transform(max) == medium_df_eda['claps']

The code above will give us indexes of the rows with maximum number of claps and then we can select those rows from out dataset

In [None]:
best_articles = medium_df_eda[indexes]
best_articles

In [None]:
for index, row in best_articles.iterrows():
    print("The article '{}' from '{}' publication has highest number of claps i.e. '{}'".format(row.title, row.publication, row.claps))
    print("\n")

So, these are the articles from specfic publications with maximum number of claps

### Q: What is the reading time for the article with highest claps among the publications

In [None]:
for index, row in best_articles.iterrows():
    print("The best article from {} has reading time of {} minutes".format(row.publication, row.reading_time))
    print("\n")


### Q: What is the avarage reading time for the article with highest claps among the publications

In [None]:
best_articles.reading_time.mean()

This number again backs up two of our analysis which we did previously i.e. 5 - 10 minutes. Hence we can say that reading time plays an effective role in success of the article.

### Q: what percent of best blogs have subtitles?

In [None]:
percentage = best_articles.subtitle.count()/len(best_articles) * 100
print("About {} percent of articles have subtitles".format(percentage))

Wow thats the huge number, so we can say that writing subtitle is an import part for sucessful blog.

Now that we got answers to out questions let's move to the conclusion.

In [None]:
import jovian

In [None]:
jovian.commit(project=project_name)

## Inferences and Conclusion

Here is the summary of analysis that we did from this dataset

* **"The Startup"** publication posted highest number of articles in year 2019 where as **"Better Humans"** published least number of articles. For this analysis to help in our blogging we can further look at the articles published by The Starup and analyze on what topics do they write or whats the nature of their alticles so that we get the topics which are in high demand.

* Mondays and Thursdays can be best days to publish your article so that It gets more visibility and likes. Note that this is only one factor, for an article to become successful the content, heading, etc factors matter.

* The ideal reading time for an article should be around 5 - 10 minutes i.e the article you write should not be verbose and should be up to the point so the user won't get bored and get the information he/she needs quickly.

* JPEG/JPG: This is an ideal image format for all types of photographs. PNG: This format is perfect for screenshots and other types of imagery where there’s not a lot of color data.

* Writing the subtitle to your article may be helpful, because it will help audience to understand more about your post before they click on it. It can also be a deciding factor wether a person will click the link or not.

## Refrences and Future work

Future Work:

The same analysis can be done on the dataset which has information of all the blogs over a year and not only the blogs specfic to the publications. This will help to do a strong analysis which may backup our current facts and may also generate new facts. similarly some kewords can be taken from the titles of the blog and we can possibily get the hot topics to write the blogs. 

Refrences:

- Dataset: https://www.kaggle.com/dorianlazar/medium-articles-dataset?select=medium_data.csv
- Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
- Matplotlib user guide: https://matplotlib.org/3.3.1/users/index.html
- Seaborn user guide & tutorial: https://seaborn.pydata.org/tutorial.html


In [None]:
import jovian

In [None]:
jovian.commit(project=project_name)