# Starwars movie analysis
- This project focuses on data cleaning, so that we have a dataset ready for further analysis.
- The data is explored from FiveThirtyEight's Starwars survey data.
- The data is in csv file, let's explore and clean our data.

## Data Cleaning

In [1]:
# Read csv
import pandas as pd
star_wars = pd.read_csv('StarWars.csv', encoding='ISO-8859-1')  # file not encoded in UTF-8, 
                                                                # which is the default encoding for pd.read_csv() in pandas

In [2]:
star_wars.head(4)

Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,Star Wars: Episode I The Phantom Menace,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
2,3292880000.0,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
3,3292765000.0,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central


In [3]:
star_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 38 columns):
 #   Column                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                         --------------  -----  
 0   RespondentID                                                                                                                                   1186 non-null   float64
 1   Have you seen any of the 6 films in the Star Wars franchise?                                                                                   1187 non-null   object 
 2   Do you consider yourself to be a fan of the Star Wars film franchise?                                                                          837 non-null    object 
 3   Which of the following Star 

In [4]:
star_wars.reset_index().head(3)

Unnamed: 0,index,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,0,,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
1,1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
2,2,3292880000.0,No,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central


- We can see that 0th row has subtitles of first row, and respondent data is actually starting from row 1.
- We will copy this subtitles for unnamed column names, and drop 0th row.
- Abd reset index again with 0th row as first respondent's response.

In [5]:
# Access 0th row 
row_zero = star_wars.loc[0]
row_zero.reset_index()

Unnamed: 0,index,0
0,RespondentID,
1,Have you seen any of the 6 films in the Star W...,Response
2,Do you consider yourself to be a fan of the St...,Response
3,Which of the following Star Wars films have yo...,Star Wars: Episode I The Phantom Menace
4,Unnamed: 4,Star Wars: Episode II Attack of the Clones
5,Unnamed: 5,Star Wars: Episode III Revenge of the Sith
6,Unnamed: 6,Star Wars: Episode IV A New Hope
7,Unnamed: 7,Star Wars: Episode V The Empire Strikes Back
8,Unnamed: 8,Star Wars: Episode VI Return of the Jedi
9,Please rank the Star Wars films in order of pr...,Star Wars: Episode I The Phantom Menace


In [6]:
# Drop 0th row  with subtitles
star_wars = star_wars.drop(0)

In [7]:
# Reset index 
star_wars = star_wars.reset_index()
star_wars.head()

Unnamed: 0,index,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,2,3292880000.0,No,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3,3292765000.0,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
3,4,3292763000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
4,5,3292731000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


In [8]:
# Rename columns for seen movies for clarity
star_wars = star_wars.rename(columns={"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
                                      'Unnamed: 4':"seen_2", 'Unnamed: 5':"seen_3", 'Unnamed: 6':"seen_4", 'Unnamed: 7':"seen_5", 'Unnamed: 8':"seen_6",})

# Rename columns for movie ranking for clarity
star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':"ranking_1",
       'Unnamed: 10':"ranking_2", 'Unnamed: 11':"ranking_3", 'Unnamed: 12':"ranking_4", 'Unnamed: 13':"ranking_5",'Unnamed: 14':"ranking_6"})
                 
star_wars.columns  # renamed columns

Index(['index', 'RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1',
       'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?æ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Reg

In [9]:
# Replace column names from index 15 to 28 by their subtitle names
for col in star_wars.iloc[:, 16:30]:
    # print(col)
    star_wars = star_wars.rename(columns = {col:row_zero[col]})

In [10]:
star_wars.columns  # All renamed columns

Index(['index', 'RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1',
       'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
       'Han Solo', 'Luke Skywalker', 'Princess Leia Organa',
       'Anakin Skywalker', 'Obi Wan Kenobi', 'Emperor Palpatine',
       'Darth Vader', 'Lando Calrissian', 'Boba Fett', 'C-3P0', 'R2 D2',
       'Jar Jar Binks', 'Padme Amidala', 'Yoda', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?æ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

In [11]:
star_wars.head(2)

Unnamed: 0,index,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,2,3292880000.0,No,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central


In [12]:
# Map yes/no of columns 1 and 2 with true/false

yes_no = {"Yes": True, "No": False}

for col in ['Have you seen any of the 6 films in the Star Wars franchise?',
            'Do you consider yourself to be a fan of the Star Wars film franchise?']:
        star_wars[col] = star_wars[col].map(yes_no)

star_wars.head(3)

Unnamed: 0,index,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,1,3292880000.0,True,True,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,2,3292880000.0,False,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3,3292765000.0,True,False,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central


In [13]:
# Change Respondent ID to int for clarity
star_wars['RespondentID']  = star_wars['RespondentID'].astype('int') 

In [14]:
star_wars.head(3)

Unnamed: 0,index,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,1,3292879998,True,True,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,2,3292879538,False,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3,3292765271,True,False,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central


In [15]:
# Replace seen movie names with true/false from seen_1 to seen_6 columns
# Pay attention to correct string names

import numpy as np 

movie_mapping = {
                 "Star Wars: Episode I  The Phantom Menace": True,
                 "Star Wars: Episode II  Attack of the Clones":True,	
                 "Star Wars: Episode III  Revenge of the Sith": True,
                 "Star Wars: Episode IV  A New Hope":True, 
                 "Star Wars: Episode V The Empire Strikes Back":True,	
                 "Star Wars: Episode VI Return of the Jedi": True,
                 np.nan : False
                }
for col in star_wars.iloc[:,4:10]:
    # print(col)
    star_wars[col] = star_wars[col].map(movie_mapping)

In [16]:
star_wars.head(3)

Unnamed: 0,index,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,1,3292879998,True,True,True,True,True,True,True,True,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,2,3292879538,False,,False,False,False,False,False,False,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3,3292765271,True,False,True,True,True,False,False,False,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central


In [17]:
# Get index of all columns
for col_index, col in enumerate(star_wars.columns):
    print(col_index, col)

0 index
1 RespondentID
2 Have you seen any of the 6 films in the Star Wars franchise?
3 Do you consider yourself to be a fan of the Star Wars film franchise?
4 seen_1
5 seen_2
6 seen_3
7 seen_4
8 seen_5
9 seen_6
10 ranking_1
11 ranking_2
12 ranking_3
13 ranking_4
14 ranking_5
15 ranking_6
16 Han Solo
17 Luke Skywalker
18 Princess Leia Organa
19 Anakin Skywalker
20 Obi Wan Kenobi
21 Emperor Palpatine
22 Darth Vader
23 Lando Calrissian
24 Boba Fett
25 C-3P0
26 R2 D2
27 Jar Jar Binks
28 Padme Amidala
29 Yoda
30 Which character shot first?
31 Are you familiar with the Expanded Universe?
32 Do you consider yourself to be a fan of the Expanded Universe?æ
33 Do you consider yourself to be a fan of the Star Trek franchise?
34 Gender
35 Age
36 Household Income
37 Education
38 Location (Census Region)


In [18]:
# Make ranking columns float to find average values 

star_wars[star_wars.columns[10:16]] = star_wars[star_wars.columns[10:16]].astype(float)


In [19]:
star_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 39 columns):
 #   Column                                                                 Non-Null Count  Dtype  
---  ------                                                                 --------------  -----  
 0   index                                                                  1186 non-null   int64  
 1   RespondentID                                                           1186 non-null   int64  
 2   Have you seen any of the 6 films in the Star Wars franchise?           1186 non-null   bool   
 3   Do you consider yourself to be a fan of the Star Wars film franchise?  836 non-null    object 
 4   seen_1                                                                 1186 non-null   bool   
 5   seen_2                                                                 1186 non-null   bool   
 6   seen_3                                                                 1186 non-null   b

## Data Analysis
After cleaning the data, we can now use it for further analysis.

- We will analyse: 
1. Average ranking of movies.
2. Total view counts of movies.
3. Gender wise average ranking of movies.
4. Gender wise view counts of movies
5. Most pupular characters

### 1. Average ranking of starwars episdoes

In [20]:
starwars_ranking_cols = star_wars.iloc[:, 10:16]
starwars_ranking_cols = starwars_ranking_cols.dropna()


In [21]:
# Change column names of ranking
new_names = {
    'ranking_1': 'Episode I',
    'ranking_2': 'Episode II',
    'ranking_3': 'Episode III',
    'ranking_4': 'Episode IV',
    'ranking_5': 'Episode V',
    'ranking_6': 'Episode VI',
    'seen_1': 'Episode I',
    'seen_2': 'Episode II',
    'seen_3': 'Episode III',
    'seen_4': 'Episode IV',
    'seen_5': 'Episode V',
    'seen_6': 'Episode VI'
}
starwars_ranking_cols = starwars_ranking_cols.rename(columns=new_names)

mean_rankings = starwars_ranking_cols.mean()

# mean_rankings = mean_rankings.rename(new_names)

In [22]:
(starwars_ranking_cols.values>6).sum()

np.int64(0)

In [23]:
mean_rankings

Episode I      3.736211
Episode II     4.091127
Episode III    4.344125
Episode IV     3.269784
Episode V      2.511990
Episode VI     3.046763
dtype: float64

In [96]:
import plotly.graph_objects as go

fig = go.Figure()

for col in starwars_ranking_cols.columns:
    fig.add_trace(go.Violin(
        y=starwars_ranking_cols[col],
        name= col,  # Name of episode
        box_visible = True,  # To show box plot inside violin
        meanline_visible=True, # To show mean line
        bandwidth=0.4
))
    # Add the mean values as a separate trace (optional, to highlight the mean rankings)
fig.add_trace(go.Scatter(
    x=mean_rankings.index,
    y=mean_rankings.values,
    mode='markers',
    name='Mean Rankings',
    marker=dict(color='black', size=7, symbol='x')
))

fig.update_layout(
    title = 'Distribution of Movie Rankings',
    xaxis_title= 'Movie Episode',
    yaxis_title = 'Ranking',
    violingap=0.1,
    width= 1000
)
fig.show()


From this plot we can say that the earlier second and third episodes got the larger number of higher rankings, thus their mean ranking exceeds 4 out of maximum 6 rating. This indiactes that Episodes II and III are generally viewed as more favotably than others. While the first episode has the most average reviews with mean ranking of 3.7 but median of 4. We can say that the Episode I is not the favorite of most audience but can be seen as impactful or enjoyable. But the last two episodes are ranked lowest with mean ranking of 3 or less. 

But the higher Intequartile range of Episode III and and IV indicates higher variablity in rankings given by respondents compared to all other episodes. In terms of IQR the Episode I and II have received more consistent rankings, with median near 3.7 and 4 respectively.

Overall, Episodes II and III are liked the most, but Episodes III and IV have more varied opinions, while Episodes I and II have more consistent ratings.

### 2. Total view Counts of starwars movies

In [25]:
star_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 39 columns):
 #   Column                                                                 Non-Null Count  Dtype  
---  ------                                                                 --------------  -----  
 0   index                                                                  1186 non-null   int64  
 1   RespondentID                                                           1186 non-null   int64  
 2   Have you seen any of the 6 films in the Star Wars franchise?           1186 non-null   bool   
 3   Do you consider yourself to be a fan of the Star Wars film franchise?  836 non-null    object 
 4   seen_1                                                                 1186 non-null   bool   
 5   seen_2                                                                 1186 non-null   bool   
 6   seen_3                                                                 1186 non-null   b

In [26]:
star_wars.columns[4:10].isnull()

array([False, False, False, False, False, False])

- Seen columns have no NaN values, beacuse we already mapped NaN values in this columns as False i.e. not seen.

In [27]:
# View Counts

counts = star_wars[star_wars.columns[4:10]].sum()
# counts = counts.rename(new_names)
counts

seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738
dtype: int64

In [28]:
counts= counts.rename(columns=new_names)

TypeError: Series.rename() got an unexpected keyword argument 'columns'

In [97]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x=counts.index,
    y= counts.values, 
    width= 0.5, 
    name='Total View Counts of Different Episodes',
    text=list(counts.values),
    textposition='inside'
))
fig.update_layout(
     xaxis_title='Movie Episode', 
     yaxis_title = 'View Counts',
     width = 1000, 
     title='Total View Counts of Different Episodes'
     
)

The latest episodes have highest number of viewers.

### 3. Gender wise Average Rankings

In [29]:
# Average rankings given by genders

male_rows = star_wars[star_wars["Gender"] == "Male"]
female_rows = star_wars[star_wars["Gender"]=="Female"]

In [30]:
male_rows.columns[10:16]

Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

In [31]:
male_ranking_cols = male_rows[male_rows.columns[10:16]]
male_ranking_cols = male_ranking_cols.dropna()

In [32]:
female_ranking_cols = female_rows[female_rows.columns[10:16]]
female_ranking_cols = female_ranking_cols.dropna()

In [33]:
new_names
male_ranking_cols = male_ranking_cols.rename(columns=new_names)
female_ranking_cols = female_ranking_cols.rename(columns=new_names)

In [34]:
female_ranking_cols

Unnamed: 0,Episode I,Episode II,Episode III,Episode IV,Episode V,Episode VI
111,1.0,4.0,5.0,6.0,2.0,3.0
112,1.0,2.0,3.0,4.0,5.0,6.0
114,3.0,4.0,5.0,6.0,1.0,2.0
116,3.0,4.0,5.0,6.0,2.0,1.0
117,3.0,6.0,4.0,1.0,2.0,5.0
...,...,...,...,...,...,...
1180,3.0,4.0,5.0,2.0,1.0,6.0
1181,5.0,4.0,6.0,3.0,2.0,1.0
1182,4.0,5.0,6.0,2.0,3.0,1.0
1184,4.0,3.0,6.0,5.0,2.0,1.0


In [35]:
male_ranking_cols

Unnamed: 0,Episode I,Episode II,Episode III,Episode IV,Episode V,Episode VI
0,3.0,2.0,1.0,4.0,5.0,6.0
2,1.0,2.0,3.0,4.0,5.0,6.0
3,5.0,6.0,1.0,2.0,4.0,3.0
4,5.0,4.0,6.0,2.0,1.0,3.0
5,1.0,4.0,3.0,6.0,5.0,2.0
...,...,...,...,...,...,...
1146,4.0,5.0,3.0,6.0,1.0,2.0
1148,5.0,3.0,4.0,6.0,1.0,2.0
1164,5.0,4.0,3.0,6.0,1.0,2.0
1169,4.0,5.0,6.0,3.0,1.0,2.0


In [36]:
male_ranking_mean = male_ranking_cols.mean()
female_ranking_mean = female_ranking_cols.mean()

In [37]:
male_ranking_mean

Episode I      4.045024
Episode II     4.229858
Episode III    4.274882
Episode IV     2.995261
Episode V      2.452607
Episode VI     3.002370
dtype: float64

In [38]:
female_ranking_mean

Episode I      3.429293
Episode II     3.957071
Episode III    4.424242
Episode IV     3.540404
Episode V      2.573232
Episode VI     3.075758
dtype: float64

In [101]:
import plotly.graph_objects as go

fig = go.Figure()

for col in male_ranking_cols.columns:
    fig.add_trace(go.Violin(
        y=male_ranking_cols[col],
        name= col,  # Name of episode
        box_visible = True,  # To show box plot inside violin
        meanline_visible=True, # To show mean line
        bandwidth=0.4
))
    # Add the mean values as a separate trace (optional, to highlight the mean rankings)
fig.add_trace(go.Scatter(
    x=male_ranking_mean.index,
    y=male_ranking_mean.values,
    mode='markers',
    name='Mean Rankings',
    marker=dict(color='black', size=7, symbol='x')
))

fig.update_layout(
    title = 'Genderwise Distribution of Movie Rankings - Male',
    xaxis_title= 'Movie Episode',
    yaxis_title = 'Ranking',
    violingap=0.1,
    width= 1000
)
fig.show()


fig = go.Figure()

for col in female_ranking_cols.columns:
    fig.add_trace(go.Violin(
        y = female_ranking_cols[col],
        name=col,
        box_visible=True,
        meanline_visible=True,
    ))

fig.add_trace(go.Scatter(
    x=female_ranking_mean.index,
    y= female_ranking_mean.values,
    mode='markers',
    marker=dict(color='black', symbol='x'), 
    name='Mean'
    
))

fig.update_layout(width=1000,
    xaxis_title='Movie Episode',
    yaxis_title='Ranking',
    violingap=0.1,
    title= 'Genderwise Distribution of Movie Rankings - Female'
)

fig.show()


- The gender distribution in rankings for Episodes I and IV shows significant differences. Male respondents generally rated Episode I more favorably, while female viewers gave it an average rating. Episode IV had mixed reactions from female viewers, with a high IQR indicating either strong approval or disapproval. On the other hand, male viewers consistently rated Episode IV lower. For the other episodes, the ranking distributions appear similar across genders.

### 4. Gender wise View Counts

In [47]:
# View counts by Gender
male_view_counts = male_rows[male_rows.columns[4:10]].sum()
male_view_counts = male_view_counts.rename(new_names)   # Renames index
male_view_counts

Episode I      361
Episode II     323
Episode III    317
Episode IV     342
Episode V      392
Episode VI     387
dtype: int64

In [46]:
female_view_counts = female_rows[female_rows.columns[4:10]].sum()
female_view_counts = female_view_counts.rename(new_names)
female_view_counts

Episode I      298
Episode II     237
Episode III    222
Episode IV     255
Episode V      353
Episode VI     338
dtype: int64

In [102]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x = male_view_counts.index,
    y = male_view_counts.values,
    name = 'Male',
    width= 0.35,
    text = male_view_counts.values
))

fig.add_trace(go.Bar(
    x = female_view_counts.index,
    y = female_view_counts.values,
    name='Female',
    width = 0.35, 
    text= female_view_counts.values
))
fig.update_layout(
    barmode= 'group', 
    width = 1000,
    title = 'View Counts of Movies by Gender',
    xaxis_title= 'Movie Episode',
    yaxis_title= 'View Counts'

    )

More males and females both have watched newer episodes more than older episodes. But ranking given to older movies is much higher than newer ones.

### 5. Most popular character

In [99]:
characters = star_wars[star_wars.columns[16:30]]
characters.head()

Unnamed: 0,Han Solo,Luke Skywalker,Princess Leia Organa,Anakin Skywalker,Obi Wan Kenobi,Emperor Palpatine,Darth Vader,Lando Calrissian,Boba Fett,C-3P0,R2 D2,Jar Jar Binks,Padme Amidala,Yoda
0,Very favorably,Very favorably,Very favorably,Very favorably,Very favorably,Very favorably,Very favorably,Unfamiliar (N/A),Unfamiliar (N/A),Very favorably,Very favorably,Very favorably,Very favorably,Very favorably
1,,,,,,,,,,,,,,
2,Somewhat favorably,Somewhat favorably,Somewhat favorably,Somewhat favorably,Somewhat favorably,Unfamiliar (N/A),Unfamiliar (N/A),Unfamiliar (N/A),Unfamiliar (N/A),Unfamiliar (N/A),Unfamiliar (N/A),Unfamiliar (N/A),Unfamiliar (N/A),Unfamiliar (N/A)
3,Very favorably,Very favorably,Very favorably,Very favorably,Very favorably,Somewhat favorably,Very favorably,Somewhat favorably,Somewhat unfavorably,Very favorably,Very favorably,Very favorably,Very favorably,Very favorably
4,Very favorably,Somewhat favorably,Somewhat favorably,Somewhat unfavorably,Very favorably,Very unfavorably,Somewhat favorably,Neither favorably nor unfavorably (neutral),Very favorably,Somewhat favorably,Somewhat favorably,Very unfavorably,Somewhat favorably,Somewhat favorably


In [54]:
set(characters[characters.columns[0]])  # Unique set of entries for popularity

{'Neither favorably nor unfavorably (neutral)',
 'Somewhat favorably',
 'Somewhat unfavorably',
 'Unfamiliar (N/A)',
 'Very favorably',
 'Very unfavorably',
 nan}

In [55]:
len(characters)

1186

In [56]:
characters.isna().sum()

Han Solo                357
Luke Skywalker          355
Princess Leia Organa    355
Anakin Skywalker        363
Obi Wan Kenobi          361
Emperor Palpatine       372
Darth Vader             360
Lando Calrissian        366
Boba Fett               374
C-3P0                   359
R2 D2                   356
Jar Jar Binks           365
Padme Amidala           372
Yoda                    360
dtype: int64

- NaN response to these columns means: 
1. Either respondent dont like or dislike anyone: That means neutral towards all characters.
2. Either respondent is not interested in individual charecters..that means he is neutral towards every character. 

- We will consider NaN entries as a separate segment: 'Review Unavailable'.

In [57]:
# map categories to main three categories
popularity = {'Neither favorably nor unfavorably (neutral)':'neutral',
 'Somewhat favorably': 'favorable',
 'Somewhat unfavorably': 'unfavorable',
 'Unfamiliar (N/A)': 'neutral',
 'Very favorably': 'favorable',
 'Very unfavorably': 'unfavorable',
 np.nan: 'review unavailable'}

In [58]:
for col in characters.columns:
    characters.loc[:, col]= characters[col].map(popularity)

In [59]:
characters.isna().sum()

Han Solo                0
Luke Skywalker          0
Princess Leia Organa    0
Anakin Skywalker        0
Obi Wan Kenobi          0
Emperor Palpatine       0
Darth Vader             0
Lando Calrissian        0
Boba Fett               0
C-3P0                   0
R2 D2                   0
Jar Jar Binks           0
Padme Amidala           0
Yoda                    0
dtype: int64

In [61]:
characters.head()

Unnamed: 0,Han Solo,Luke Skywalker,Princess Leia Organa,Anakin Skywalker,Obi Wan Kenobi,Emperor Palpatine,Darth Vader,Lando Calrissian,Boba Fett,C-3P0,R2 D2,Jar Jar Binks,Padme Amidala,Yoda
0,favorable,favorable,favorable,favorable,favorable,favorable,favorable,neutral,neutral,favorable,favorable,favorable,favorable,favorable
1,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable,review unavailable
2,favorable,favorable,favorable,favorable,favorable,neutral,neutral,neutral,neutral,neutral,neutral,neutral,neutral,neutral
3,favorable,favorable,favorable,favorable,favorable,favorable,favorable,favorable,unfavorable,favorable,favorable,favorable,favorable,favorable
4,favorable,favorable,favorable,unfavorable,favorable,unfavorable,favorable,neutral,favorable,favorable,favorable,unfavorable,favorable,favorable


In [62]:
popularity_counts = characters.apply(pd.Series.value_counts) # This counts the unique values in each column

In [63]:
popularity_counts

Unnamed: 0,Han Solo,Luke Skywalker,Princess Leia Organa,Anakin Skywalker,Obi Wan Kenobi,Emperor Palpatine,Darth Vader,Lando Calrissian,Boba Fett,C-3P0,R2 D2,Jar Jar Binks,Padme Amidala,Yoda
favorable,761,771,757,514,750,253,481,365,291,703,747,242,351,749
neutral,59,44,56,187,60,369,94,384,380,94,67,273,371,61
review unavailable,357,355,355,363,361,372,360,366,374,359,356,365,372,360
unfavorable,9,16,18,122,15,192,251,71,141,30,16,306,92,16


In [65]:
categories = popularity_counts.index
categories


Index(['favorable', 'neutral', 'review unavailable', 'unfavorable'], dtype='object')

In [103]:
fig = go.Figure()

for category in popularity_counts.index:
    fig.add_trace(go.Bar(
        y = popularity_counts.columns,
        x = popularity_counts.loc[category], 
        name = category, 
        orientation='h'
    ))

# Update layout for stacked bars
fig.update_layout(
    barmode='stack',
    title='Review of Star Wars Characters Based on Familiarity and Public Sentiment',
    xaxis_title='Count',
    yaxis_title='Character',
    width=1000,
    height=600
)
fig.show()

This chart says it all. Han solo, Luke Skywalker, Princess Leia Organa, Kenobi, R2 D2, and Yoda are veiwed most favourably. While more divisive characters like Jar Jar Binks and Emperor Palpatine receive mixed reactions. The absence of reviews for some characters may reflect generational gaps, varying levels of familiarity, or the complexity of the characters themselves.