**Inaugural Speeches** of American President are noticed not only by people from America, but also from all over the world. Those speeches conveys to the world how America is going to be for next 4 years. 

To make the analysis more interesting, I have added the Political Party of each speaker.

**If this helped you, some upvotes  would be very much appreciated - that's where I get my motivation.**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud,STOPWORDS

pal = sns.color_palette()
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

Let's look into the data !

In [None]:
df = pd.read_csv("../input/inaug_speeches.csv",encoding="iso-8859-1")
df = df.iloc[:,1:]
df.head()

In [None]:
print("Total number of speeches :",len(df))

In [None]:
df.isnull().any()

The data looks good, let's see what are other features we could add to this data before analyzing.

In [None]:
#Number of words in each speech
df["word_count"] = df["text"].apply(lambda x : len(x))
#Number of unique words in each speech
df["unique_word"] = df["text"].apply(lambda x : len(set(x.lower().split()) ) )
#Number of unique words ratio in each speech
df["unique_word_ratio"] = df.apply(lambda x : x["unique_word"]/x["word_count"] ,axis=1)
#Extracting year alone from the Date column
df["year"] = df["Date"].apply(lambda x : int(x.split(",")[2])  if len(x.split(","))==3 else int(x.split(",")[1]) )
df.head()

It would be exciting if we could add the party of each American President.

The data regarding the parties associated are retrieved from this [link.][1]


  [1]: http://www.enchantedlearning.com/history/us/pres/list.shtml

In [None]:
party_dict = {
    "Federalist":["George Washington","John Adams"],
    "Democratic-Republican":['Thomas Jefferson',
       'James Madison', 'James Monroe', 'John Quincy Adams'],
    "Democrat":['Andrew Jackson', 'Martin Van Buren',
                'James Knox Polk','Franklin Pierce',
                'James Buchanan','Grover Cleveland',
               'Woodrow Wilson','Franklin D. Roosevelt',
       'Harry S. Truman','John F. Kennedy',
       'Lyndon Baines Johnson','Jimmy Carter',
                'Bill Clinton', 'Barack Obama'
               ],
    "Whig":["William Henry Harrison",'Zachary Taylor'],
    "Republican":['Abraham Lincoln', 'Ulysses S. Grant',
       'Rutherford B. Hayes', 'James A. Garfield','Benjamin Harrison', 'William McKinley', 'Theodore Roosevelt',
       'William Howard Taft','Warren G. Harding',
       'Calvin Coolidge', 'Herbert Hoover',
                  'Dwight D. Eisenhower',
                  'Richard Milhous Nixon',
                  'Ronald Reagan', 'George Bush',
                  'George W. Bush','Donald J. Trump'
                 ]
    
}
def get_party(name):
    for party,names in party_dict.items():
        if name in names:
            return party
df["party"] = df["Name"].apply(lambda x : get_party(x))
g= sns.countplot(y="party",data=df)
plt.title("Number of Speeches per Political Party")

Let's start analysis !

## Analysis with Word count feature ##

In [None]:
ax= sns.boxplot(df["word_count"],orient='v')
plt.title("Box plot of Word Count")

It show that 50% of the speeches has *8000* to *18000* words. We could also notice some outliers which has nearly 50000 words in a speech.

Let's check the details about extreme speeches with respect to word count.

In [None]:
print("Average number of words per speech : ",df["word_count"].mean())
print()
print("Smallest Speech : ")
print(df.ix[df["word_count"].idxmin(axis=1)])
print()
print("Longest Speech : ")
print(df.ix[df["word_count"].idxmax(axis=1)])

## Smallest Speech

<table  border="0">

<tr>
<td rowspan=4><img src="http://cdn.playbuzz.com/cdn/283d32e4-e028-421f-b36c-0d17a142cf84/03b401ee-89e7-4abd-bfd0-df61ac12d9a1.jpg" alt="George Washington" style="height:100px;width: 100px;"/>
</td>
<td> Name</td> 
<td>George Washington</td>
</tr>

<tr>
<td> Year </td> 
<td>1793</td>
</tr>
<tr>
<td> Word Count </td>
<td>819</td>
</tr>

<tr>
<td> Unique Word </td>
<td>90</td>
</tr>



</table>



## Longest Speech

<table border="0">

<tr>
<td rowspan=4><img src="http://media1.britannica.com/eb-media/43/126143-004-0383748E.jpg" alt="William Henry Harrison" style="height:100px;width: 100px;"/>  
</td>
<td>Name</td>
<td>William Henry Harrison</td>
</tr>

<tr>
<td> Year </td> 
<td>1841</td>
</tr>

<tr>
<td> Word Count </td>
<td>49885</td>
</tr>

<tr>
<td> Unique Word </td>
<td>2152</td>
</tr>
</table>

Let's study word count feature along with party .

In [None]:
wc_mean = df.groupby("party")["word_count"].mean().reset_index()
g =sns.factorplot(x="party",y="word_count",kind="bar",data=wc_mean)
g.set_xticklabels(rotation=90)

The Whig Party has used most words in their inaugural speech and the reason behind it is the speech of William H. Harrison (49885words).

If we look between the Republican and Democrats, Republicans have use more words in their speeches

Let's see the variation of number of words feature in Inaugural speeches for every year.

In [None]:
g = sns.factorplot(x="year",y="word_count",hue="party",data=df,kind="bar",size=5,aspect=2,legend_out=False)
g.set_xticklabels(rotation=90)
plt.title("Year wise Speech Length")

We could notice an interesting fact that the speech length of Republicans and Democrats has significantly reduced in last 10 decades.
Let's verify that.

In [None]:
fact1 = []
val1 = df[(df["party"] == "Republican") & (df["year"]< 1930)]["word_count"].mean()
val2 = df[(df["party"] == "Republican") & (df["year"]> 1930)]["word_count"].mean()
val3 = df[(df["party"] == "Democrat") & (df["year"]< 1930)]["word_count"].mean()
val4 = df[(df["party"] == "Democrat") & (df["year"]> 1930)]["word_count"].mean()
fact1.append(["before 1930",val1,"Republican"])
fact1.append(["after 1930",val2,"Republican"])
fact1.append(["before 1930",val3,"Democrat"])
fact1.append(["after 1930",val4,"Democrat"])
fact1_df = pd.DataFrame(fact1,columns=["period","value","party"])
overall_avg = df["word_count"].mean()
g = sns.factorplot(x="period",y="value",col="party",data=fact1_df,kind="bar")
#plt.plot([overall_avg,overall_avg],'k--')
#plt.title("Republicans Word Count")

The above graph shows the decline in the speech length of Republicans and Democrats after 1930

## Analysis on Unique words ##

In [None]:
ax= sns.boxplot(df["unique_word"],orient='v')
plt.title("Box Plot of Unique Words")

This shows that 50% of the speeches has a number of unique words in the range between 600 to 1100.

As we noticed earlier, due to the presence of outliers in the number of words feature of speech, the speeches with least and most unique words will be the same as speech by George Washington and William Henry Harrison respectively.

So, let's check for the peculiar speeches with respect to Number of Unique words ratio.

In [None]:
print("Average unique words used per speech : ",df["unique_word_ratio"].mean())
print()
print("Speech with least unique words :")
print(df.ix[df["unique_word_ratio"].idxmin(axis=1)])
print()
print("Speech with most unique words :")
print(df.ix[df["unique_word_ratio"].idxmax(axis=1)])

Its obvious that, when we speak less our unique words ratio will be higher than when we speak more.

Let's check with parties !

In [None]:
uni_mean = df.groupby("party")["unique_word_ratio"].mean().reset_index()
g =sns.factorplot(x="party",y="unique_word_ratio",kind="bar",data=uni_mean)
g.set_xticklabels(rotation=90)
plt.title("Unique Word Ratio per Party")

The rule of *speak less - more unique words* still holds true.
Federalist has used more unique words as they spoke the least number of words on comparing to other parties.

In [None]:
g = sns.factorplot(x="year",y="unique_word_ratio",hue="party",data=df,kind="bar",size=5,aspect=2,legend_out=False)
g.set_xticklabels(rotation=90)
plt.title("Year wise Unique Word Ratio")

We could notice that there is no big difference in unique words ratio.