The team at FiveThirtyEight wanted to dissect the deaths of the Avengers in the comics over the years. The writers were known to kill off and revive many of the superheroes so they were curious to know what data they could grab from the Marvel Wikia site, a fan-driven community site, to explore further. To learn how they collected their data, which is available on their Github repo, read the writeup they published on their site.

While the FiveThirtyEight team has done a wonderful job acquiring this data, the data still has some inconsistencies. Your mission, if you choose to accept it, is to clean up their dataset so it can be more useful for analysis in Pandas. First things first: read our dataset into Pandas as a DataFrame and preview the first 5 rows to get a better sense of our data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
avengers = pd.read_csv("/kaggle/input/avengers/avengers.csv", encoding = "latin-1")

In [None]:
avengers.head()

In [None]:
avengers.columns

In [None]:
avengers.dtypes

In [None]:
avengers['Year'].hist()

In [None]:
avengers["Year"].describe()


We only want to keep the Avengers who were introduced after 1960. Filter out all Avengers introduced before 1960 and store only the ones added in 1960 or later in true_avengers.

In [None]:
avengers = avengers[avengers["Year"]>= 1960]

In [None]:
avengers['Year'].hist()
plt.show()

In [None]:
avengers.head()

# Consolidating Deaths 

We are interested in the number of total deaths each character experienced and we'd like a field containing that distilled information. Right now, there are 5 fields (Death1 to Death5) that each contain a binary value representing if a superhero experienced that death or not. For example, a superhero can experience Death1, then Death2, etc. until they were no longer brought back to life by the writers.

We'd like to coalesce that information into just one field so we can do numerical analysis more easily.

Create a new column, Deaths, that contains the number of times each superhero died. The possible values for each death field are YES, NO, and the Pandas NaN value used to represent missing data. Keep all of the original columns (including Death1 to Death5) and update true_avengers with the new Deaths column.

In [None]:
def deaths(series):
    death_count = 0
    cols = ["Death1", "Death2", "Death3", "Death4", "Death5"]
    counts = 0
    for i in cols:
        if series[i] == "NO" or pd.isnull(series[i]):
            continue
        else:
            counts+=1
    return counts
    

In [None]:
avengers["Deaths"] = avengers.apply(deaths, axis = 1)

In [None]:
cols = ["Death1", "Death2", "Death3", "Death4", "Death5"]
counts = 0
for i in cols:
    if avengers.iloc[100][i] == "NO" or pd.isnull(avengers.iloc[100][i]):
        continue
    else:
        counts+=1
counts

In [None]:
avengers.iloc[100]

In [None]:
pd.options.display.max_columns = None
avengers.head()

# Verifying Years Since Joining

For the final task, we want to know if the Years since joining field accurately reflects the Year column. If an Avenger was introduced in Year 1960, is the Years since joining value for that Avenger 55?

Instructions

Calculate the number of rows where Years since joining is accurate. This challenge was created in 2015, so use that as the reference year. We want to know for how many rows Years since joining was correctly calculated as Year value subtracted from 2015.

In [None]:
avengers['Years since joining'].values

In [None]:
count = 0
for i, row in avengers[['Years since joining', 'Year']].iterrows():
    if ~np.isnan(row['Year']) or ~np.isnan(row['Years since joining']):
        years_joined = 2015 - int(row['Year'])
        if years_joined == row['Years since joining']:
            count += 1
count

In [None]:
joined_accuracy_count  = int()
correct_joined_years = avengers[avengers['Years since joining'] == (2015 - avengers['Year'])]
joined_accuracy_count = len(correct_joined_years)
joined_accuracy_count

## Please upvote if you find help