## Importing Modules and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

#Suppressing all warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
df = pd.read_csv('../input/windows-store/msft.csv')

## Initial Review of Data

In [None]:
df.head()

In [None]:
df.describe(include='all')

In [None]:
df.info()

Out of 5322 rows, one row seems to have mostly all null values.

In [None]:
df[df['Name'].isna()]

This seems to be a blank record so we can drop this

In [None]:
df.drop(5321, axis=0, inplace = True)

## Checking for duplicate entries in data

In [None]:
df["Name"].value_counts()[df["Name"].value_counts() > 1]

There are three names which occur more than once in the dataset.
## Checking Records with duplicate names

In [None]:
df.loc[df['Name'].isin(df["Name"].value_counts()[df["Name"].value_counts() > 1].index.values.tolist())].sort_values(by='Name')

The name 'http://microsoft.com' seems to be a replacement for applications with missing names. That is not a problem since name is mostly irrelevant in this data analysis.

The entries for the applications with name 'Multilingual Translator' are identical except columns 'No of people Rated' and 'Price'. These could be a free and paid version from the same application developer, so this is not an issue.

Data contains no real duplicates.

## Categories

In [None]:
sns.set(rc={'figure.figsize':(12,5)})
ax = sns.countplot(x="Category", data=df.sort_values(by='Category'), order=df.Category.value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()

Music, Books, and Business are the most popular categories.

## Prices

### Percent of Free applications

In [None]:
len(df[df['Price']=='Free'])/len(df['Price'])*100

Since 97% of the applications are free, we can substitute other prices with a common value like 'Paid'

In [None]:
df.loc[~df["Price"].isin(['Free']), "Price"] = "Paid"
sns.set(rc={'figure.figsize':(12,5)})
ax = sns.countplot(x="Category", hue="Price", data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()

Only the categories 'Books', 'Business', and 'Developer Tools' have Paid application in the Windows store

## Ratings

In [None]:
sns.set(rc={'figure.figsize':(12,5)})
ax = sns.countplot(x="Rating", data=df.sort_values(by='Rating'), order=df.Rating.value_counts().index)
#ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()

In [None]:
print("Mean Rating for Free Apps:", round(df[df['Price']=='Free'].Rating.mean(),2))
print("Mean Rating for Paid Apps:", round(df[df['Price']=='Paid'].Rating.mean(),2))
print("Overall Mean Rating:", round(df.Rating.mean(), 2))

Mean rating for 'Paid' applications is considerably lower than the overall mean rating.

## Application Launch Stats

In [None]:
df['Date'].dtype

First we convert Date column to the correct data type, and make new columns for months and years

In [None]:
df["Date"] = pd.to_datetime(df["Date"])

We will split Month and Year from Date as separate columns for analysis

In [None]:
months=['January', 'February', 'March', 'April', 'May', 'June', 'July','August', 'September', 'October', 'November', 'December']
df['Launch Month']=[months[i.month-1] for i in df["Date"]]
df['Launch Year']=[i.year for i in df["Date"]]

In [None]:
sns.set(rc={'figure.figsize':(12,5)})
ax.set_title('Applications Launched Each Year')
ax = sns.countplot(x="Launch Year", data=df.sort_values(by='Launch Year'))
ax.axhline(df['Launch Year'].value_counts().mean(), color='green', linewidth=2)
ax.margins(0.05)
ax.annotate('Mean: {:0.2f}'.format(df['Launch Year'].value_counts().mean()), xy=(10.7, df['Launch Year'].value_counts().mean()+40),
            horizontalalignment='right', verticalalignment='center',
            )
#ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()

2016 witnessed the most applications launches in the past decade.

2019 witnessed lowest number of application launches since 2012.

2020 data is only available till June.

In [None]:
sns.set(rc={'figure.figsize':(12,5)})
ax = sns.countplot(x="Launch Month", data=df.sort_values(by='Launch Month'), order=months)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.axhline(df['Launch Month'].value_counts().mean(), color='green', linewidth=2)
ax.margins(0.05)
ax.annotate('Mean: {:0.2f}'.format(df['Launch Month'].value_counts().mean()), xy=(11.5, df['Launch Month'].value_counts().mean()+20),
            horizontalalignment='right', verticalalignment='center',
            )

plt.show()

## Thank you for viewing my notebook.
## This is my first notebook on Kaggle so any thoughts would be appreciated.