# IST 652 Final Project - A Pantheon Exploration

# Introduction

Recorded human history has only been around for 5000 years. By understanding the makeup of Pantheon, all persons today both young and old are enriched with the knowledge of global, scientific and cultural development of humanity. The meaning of the word *Pantheon* describes a group of particularly respected, famous, or important people. Over the course of human history there have been thousands of such individuals that have made an impact on society. Due to the information age that began in the latter part of the 20th century, people all over the world now have access to the same biographies of these individuals through Wikipedia in multiple languages. What was once information of limited accessibility and supply, shelved in remote libraries, is now conveniently made available through this medium.

Credit goes to the Macro Connections group at the Massachusetts Institute of Technology Media Lab and their [Pantheon Project](https://www.kaggle.com/mit/pantheon-project). Not only was a Pantheon index made available to the general public, but a popularity index was created as well. One "barrier to entry" to this content is the number of languages available for each article. This is a key measure in determining the popularity of individuals in the Pantheon. "The simpler of the two measures, which we denote as L, is the number of different Wikipedia language editions that have an article about a historical character. The more sophisticated measure, which we name the Historical Popularity Index (HPI) corrects L by adding information on the age of the historical character, the concentration of page views among different languages, the coefficient of variation in page views, and the number of page views in languages other than English." [https://www.kaggle.com/mit/pantheon-project](https://www.kaggle.com/mit/pantheon-project)

The significance of historical figures may be debatable, but this report seeks to apply an objective measure of popularity to better understand how biographies of the Pantheon is being consumed in the current day and age. From Aristotle to Benjamin Franklin, Jesus Christ to Al Pacino, historical figures and their attributes will be measured with web analytics so that the reader may have insights into the following:

Research questions:

* Which historic characters are the most popular?
* When did they live and where are they from?
* What factors could have generated their popularity?
* Are there any observable trends in the categorical data provided?
* What associations exist between the variables in the data?
* What clusters and groupings are in the data? How do groups compare in popularity?

# Analysis

Key analysis methods used in report:
* Data Cleaning
* Sorting and subsets of the data.
* Line and bar plots.
* Multiple regression

## About the Data

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sn

NameError: name 'sns' is not defined

In [0]:
# Loading data from .csv file
dataFileName = "data/database.csv"
isExist = os.path.isfile(dataFileName)
if isExist == True:
    dfDirtyData = pd.read_csv(dataFileName, sep=",", header=0)
else:
    print("File not found:", os.getcwd())


In [0]:
# Showing inforamtion about the dataframe
dfDirtyData.info()

In [0]:
# Creating a varibale to hold fields for displaying purpose since not all will fit
columnToDisplayDirtyData = ['full_name', 'sex', 'birth_year', 'country', 'occupation','historical_popularity_index']

In [0]:
# Showing a single row of record
dfDirtyData.loc[1522]

In [0]:
# Showing the first 5 rows of the data in the data frame
dfDirtyData.loc[:,columnToDisplayDirtyData].head(5)

In [0]:
# Showing the first 5 rows of the data in the data frame
dfDirtyData.loc[:,columnToDisplayDirtyData].tail(5)

## Data Cleaning

In [0]:
# Creating a new copy of a dataframe
dfCleanData = dfDirtyData.copy()

In [0]:
# Renaming historical_popularity_index to popularity (short name)
dfCleanData.rename(columns={'historical_popularity_index': 'popularity'}, inplace=True)

In [0]:
# Finding the index of a column with alpha numberic values.
def findStringValueIndex(fieldName):
    badRowIndex = [] # initialize list of problematic rows
    for idx, value in enumerate(dfCleanData[fieldName]):
        try:
            int(value)
        except:
            badRowIndex.append(idx)
    return badRowIndex

In [0]:
# List of columns to be converted into numeric data type
columnToConvert = ["article_languages", "birth_year", "latitude", "longitude", "page_views","average_views", "popularity"]
# Making sure to replace the non-numeric values to 0 before converting to numeric
# Examples: 
#           dfCleanData.loc[findStringValueIndex('latitude'), 'latitude'] = 0
# The following code block is doing similiar to the examples but 
# through the iteration of the list 
for col in columnToConvert:
    indx = []
    indx = findStringValueIndex(col)
    if len(indx) > 0 :
        dfCleanData.loc[indx, col] = 0
# Converting the columns above to numeric data type
dfCleanData[columnToConvert] = dfCleanData[columnToConvert].apply(pd.to_numeric)

In [0]:
# Looping through columns in the dataframe. 
# Replacing columns values with 'object' data type from 'NaN' to 'NA'
for column in dfCleanData.columns:
    if(dfCleanData[column].dtype == 'object'):
          dfCleanData[column].fillna(value="NA", inplace=True)

In [0]:
# Showing inforamtion about the dataframe
dfCleanData.info()

In [0]:
# Creating a varibale to hold fields for displaying purpose since not all will fit
columnToDisplayCleanData = ['full_name', 'sex', 'birth_year', 'country', 'occupation','popularity','longitude','latitude','page_views','average_views']

In [0]:
# Showing the first 5 rows of the data in the dataset
dfCleanData.loc[:,columnToDisplayCleanData].head(5)

In [0]:
# Showing the last 5 rows of the data in the dataset
dfCleanData.loc[:,columnToDisplayCleanData].tail(5)

## Exploration

In [0]:
# Which historic characters are in the top ten most popular?
dfTopTenMostPopularPeople = dfCleanData.sort_values(by = ['popularity'], ascending = False)
dfTopTenMostPopularPeople = dfTopTenMostPopularPeople.head(10)
dfTopTenMostPopularPeople.loc[:,columnToDisplayCleanData]

In [0]:
dfTotalTop10ByCountry = pd.DataFrame(dfTopTenMostPopularPeople.groupby(['country'])['article_id'].count())
dfTotalTop10ByCountry.rename(columns={'':'country','article_id': 'total_count'}, inplace=True)
dfTotalTop10ByCountry 

In [0]:
plt.plot(dfTotalTop10ByCountry)

In [0]:
# Which historic characters are in the top ten least popular?
dfTopTenLeastPopularPeople = dfCleanData.sort_values(by = ['popularity'], ascending = True)
dfTopTenLeastPopularPeople = dfTopTenLeastPopularPeople.head(10)
dfTopTenLeastPopularPeople.loc[:,columnToDisplayCleanData].head(10)

In [0]:
dfTotalTop10LByContinentByCountry = pd.DataFrame(dfTopTenLeastPopularPeople.groupby([ 'country','sex'])['article_id'].count())
dfTotalTop10LByContinentByCountry.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalTop10LByContinentByCountry

In [0]:
plt.plot(dfTotalTop10ByContinentByCountry)

In [0]:
# Getting the count by country, occupation, industry, and domain
dfTotalByCountryByIndustry = pd.DataFrame(dfCleanData.groupby(['industry','occupation'])['article_id'].count())
# Renaming the default column name the dataframe created to a meaningfull name
dfTotalByCountryByIndustry.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalByCountryByIndustry.sort_values(by = ['total_count'], ascending = False).head(20)

In [0]:
# Getting the count by countinent and by country
dfTotalByContinentByCountry = pd.DataFrame(dfCleanData.groupby(['continent', 'country'])['article_id'].count())
# Renaming the default column name the dataframe created to a meaningfull name
dfTotalByContinentByCountry.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalByContinentByCountry.sort_values(by = ['total_count'], ascending = False).head(10)

In [0]:
# Getting the count by countinent
dfTotalByContinent = pd.DataFrame(dfCleanData.groupby(['continent'])['article_id'].count())
# Renaming the default column name the dataframe created to a meaningfull name
dfTotalByContinent.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalByContinent.sort_values(by = ['total_count'], ascending = False)

## Modeling

# Results

# Conclusion