<h1>Data Cleaning and Analysis with Python</h1>
<h3>Overview</h3>
<ul>
    <li>Entering your data</li>
    <li>Reading datasets and cleaning data</li>
    <li>Analyzing and visualizing data</li>
       </ul>


<h3>Resources</h3>
<p>
    <ul>
        <li>O'Reilly Learning Platform: <a href = "https://databases.lib.wvu.edu/connect/1540334373" target ="_blank">https://databases.lib.wvu.edu/connect/1540334373</a></li>
    </ul>
</p>

<h2>Data science libraries in Python</h2>
<p>Listed below are the major libraries that provide built-in functions, methods, and constants that are important for doing data science analysis. Each library has a website with documentation (remember the Python Standard Library) that is great for reference and tutorials.</p>
<h3>Storage, Manipulations, Calculations</h3>
<ul>
    <li><a href ="https://numpy.org/">Numpy</a></li>
    <li><a href ="https://pandas.pydata.org/">Pandas</a></li>
    <li><a href="https://www.scipy.org/scipylib/index.html">Scipy</a></li>
    <li><a href="https://www.statsmodels.org/stable/index.html">StatsModels</a></li>
    </ul>
 <h3>Vizualization</h3>
 <ul>
    <li><a href="https://matplotlib.org/">Matplotlib</a></li>
    <li><a href ="https://bokeh.org/">Bokeh</a></li>
    </ul>
<h3>Machine Learning</h3>
  <ul>
    <li><a href ="https://scikit-learn.org/stable/">SciKit</a></li>
    <li><a href="https://www.tensorflow.org/">TensorFlow</a></li>
    <li><a href = "https://keras.io/">Keras</a></li>
    </ul>

<h2>Pandas</h2>

In [1]:
#calling a library

import pandas as pd


#Alias -> 
#you can call libraries as aliases using "as". This will allow you to simplify your code and the amount of typing that you
#need to do.

<h3>Series in Pandas</h3>
<p>A series allows you to store key-value pairs in python.</p>
<ul><li><a href ="https://pandas.pydata.org/docs/reference/api/pandas.Series.html">Pandas Documentation - Series</a></li></ul>

In [2]:
series_example = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'], name = "Example")
series_example

a    1
b    2
c    3
d    4
Name: Example, dtype: int64

In [3]:
series_example['a']

np.int64(1)

<h2>Reading Data</h2>

<h3> Read Options</h3>
<ul><li>read_csv -- Load delimited data from a file, URL, or file-like object; use comma as default delimiter</li>
<li>read_fwf -- Read data in fixed-width column format (i.e., no delimiters)</li>
<li>read_excel -- Read tabular data from an Excel XLS or XLSX file</li>
<li>read_html -- Read all tables found in the given HTML document</li>
<li>read_json -- Read data from a JSON (JavaScript Object Notation) string representation, file, URL, or file-like object</li>
<li>read_sas -- Read a SAS dataset stored in one of the SAS system’s custom storage formats</li>
<li>read_spss -- Read a data file created by SPSS</li>
<li>read_stata -- Read a dataset from Stata file format</li>
<li>read_xml -- Read a table of data from an XML file</li></ul>

In [4]:
# read a csv file

sw_df=pd.read_csv('starwars.csv')

<h3>Exploring the Dataframe</h3>

In [5]:
#Viewing dataframes .head() . tail()

sw_df.head() #default is 5

#sw_df.tail(10) #change the number displayed

Unnamed: 0,name,character_id,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,31543,172,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,34621,167,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,38280,96,32.0,,white,red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,16914,202,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,32723,150,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human


In [6]:
#viewing data values in dataframe

sw_df.dtypes

name             object
character_id      int64
height            int64
mass            float64
hair_color       object
skin_color       object
eye_color        object
birth_year      float64
sex              object
gender           object
homeworld        object
species          object
dtype: object

In [7]:
#look as basic information about the dataframe with info()

sw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          87 non-null     object 
 1   character_id  87 non-null     int64  
 2   height        87 non-null     int64  
 3   mass          80 non-null     float64
 4   hair_color    82 non-null     object 
 5   skin_color    87 non-null     object 
 6   eye_color     87 non-null     object 
 7   birth_year    43 non-null     float64
 8   sex           87 non-null     object 
 9   gender        83 non-null     object 
 10  homeworld     80 non-null     object 
 11  species       83 non-null     object 
dtypes: float64(2), int64(2), object(8)
memory usage: 8.3+ KB


In [8]:
#viewing observations and variables numbers

sw_df.shape

(87, 12)

In [9]:
#viewing the names of variables

sw_df.columns

Index(['name', 'character_id', 'height', 'mass', 'hair_color', 'skin_color',
       'eye_color', 'birth_year', 'sex', 'gender', 'homeworld', 'species'],
      dtype='object')

In [10]:
# get basic descriptive statistics

sw_df.describe()

Unnamed: 0,character_id,height,mass,birth_year
count,87.0,87.0,80.0,43.0
mean,31907.287356,173.402299,90.58,87.565116
std,11368.134613,37.222917,146.111401,154.691439
min,10630.0,35.0,15.0,8.0
25%,22061.0,165.5,57.0,35.0
50%,33496.0,180.0,77.0,52.0
75%,40833.5,191.0,84.0,72.0
max,49618.0,264.0,1358.0,896.0




<table style = "width:60%; margin-left:0; border: 3px solid #f0f0f0;">
  <tr style = "text-align: left;">
    <th>Aggregation</th>
    <th>Returns</th>
  </tr>
  <tr>
    <td>count</td>
    <td>Total number of items</td>
  </tr>    
  <tr>
    <td>first, last</td>
    <td>First and last item</td>
  </tr>   	
  <tr>
    <td>mean, median</td>
    <td>Mean and median</td>
  </tr>  
  <tr>
    <td>min, max</td>
    <td>Minimum and maximum</td>
  </tr>  
  <tr>
    <td>std, var</td>
    <td>Standard deviation and variance</td>
  </tr>  
  <tr>
    <td>sum</td>
    <td>Sum of all items</td>
  </tr> 
</table>

In [11]:
# viewing the values counts of observations in a variable

sw_df["homeworld"].value_counts()

homeworld
Naboo             11
Tatooine          10
Alderaan           3
Kamino             3
Coruscant          3
Kashyyyk           2
Corellia           2
Ryloth             2
Mirial             2
Stewjon            1
Eriadu             1
Trandosha          1
Socorro            1
Bespin             1
Mon Cala           1
Chandrila          1
Rodia              1
Nal Hutta          1
Bestine IV         1
Cato Neimoidia     1
Sullust            1
Endor              1
Malastare          1
Dathomir           1
Vulpter            1
Troiken            1
Toydaria           1
Haruun Kal         1
Cerea              1
Glee Anselm        1
Iridonia           1
Iktotch            1
Quermia            1
Dorin              1
Tund               1
Champala           1
Geonosis           1
Serenno            1
Concord Dawn       1
Zolan              1
Ojom               1
Aleen Minor        1
Skako              1
Muunilinst         1
Shili              1
Kalee              1
Umbara             1
Uta

<h2>Cleaning Data</h2>

<h3>Why would you need to clean data</h3>
<ul>
    <li>Data in columns and rows are not ordered in the correct way</li>
    <li>Creating values or ignoring missing data</li>
    <li>Units are not correct or are wrong in some way</li>
    <li>Order of magnitude is off</li>
    <li>Outliers and skewing of the data</li>
    </ul>

<h3>Change Data Types</h3>

In [12]:
# change the values 

# character_id should be a string
sw_df['character_id'] = sw_df['character_id'].astype("str")

In [13]:
sw_df.dtypes

name             object
character_id     object
height            int64
mass            float64
hair_color       object
skin_color       object
eye_color        object
birth_year      float64
sex              object
gender           object
homeworld        object
species          object
dtype: object

<h3>Filter Variables</h3>
<p>Filter out one or more observations from a variable using boolean operators to set criterias.</p>

In [14]:
sw_eye = sw_df[sw_df['eye_color']=='blue']

sw_eye

Unnamed: 0,name,character_id,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,31543,172,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
5,Owen Lars,17029,178,120.0,brown,light,blue,52.0,male,masculine,Tatooine,Human
6,Beru Whitesun lars,32889,165,75.0,brown,light,blue,47.0,female,feminine,Tatooine,Human
10,Anakin Skywalker,33979,188,84.0,blond,fair,blue,41.9,male,masculine,Tatooine,Human
11,Wilhuff Tarkin,40831,180,,auburn,fair,blue,64.0,male,masculine,Eriadu,Human
12,Chewbacca,46714,228,112.0,brown,unknown,blue,200.0,male,masculine,Kashyyyk,Wookiee
17,Jek Tono Porkins,45617,180,110.0,brown,fair,blue,,male,masculine,Bestine IV,Human
24,Lobot,17813,175,79.0,none,light,blue,37.0,male,masculine,Bespin,Human
26,Mon Mothma,46247,150,,auburn,fair,blue,48.0,female,feminine,Chandrila,Human
30,Qui-Gon Jinn,13858,193,89.0,brown,fair,blue,92.0,male,masculine,,Human


<h3>Boolean Operators</h3>
<p>Use comparison operators to determine to filter observations in a variable.</p>
<ul style>
    <li>Equal ( == )</li>
    <li>Not equal ( != )</li>
    <li>Greater than ( > )</li>
    <li>Less than ( < )</li>
    <li>Greater than or equal ( >= )</li>
    <li>Less than or equal ( <= )</li>
    <li>AND ( & )</li>
    <li>OR ( | )</li>
    </ul>

In [15]:
#Let's filter the data frame for characters who do have blue eyes AND were born after 50 BBY
sw_eye50 = sw_df[(sw_df['eye_color']=='blue') & (sw_df['birth_year']<50)]
sw_eye50

Unnamed: 0,name,character_id,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,31543,172,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
6,Beru Whitesun lars,32889,165,75.0,brown,light,blue,47.0,female,feminine,Tatooine,Human
10,Anakin Skywalker,33979,188,84.0,blond,fair,blue,41.9,male,masculine,Tatooine,Human
24,Lobot,17813,175,79.0,none,light,blue,37.0,male,masculine,Bespin,Human
26,Mon Mothma,46247,150,,auburn,fair,blue,48.0,female,feminine,Chandrila,Human
61,Barriss Offee,37584,166,50.0,black,yellow,blue,40.0,female,feminine,Mirial,Mirialan


In [16]:
#Let's filter the data frame for characters who do have blue eyes OR were born after 50 BBY
sw_eyeor50 = sw_df[(sw_df['eye_color']=='blue') | (sw_df['birth_year']<50)]
sw_eyeor50

Unnamed: 0,name,character_id,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,31543,172,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
2,R2-D2,38280,96,32.0,,white,red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,16914,202,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,32723,150,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human
5,Owen Lars,17029,178,120.0,brown,light,blue,52.0,male,masculine,Tatooine,Human
6,Beru Whitesun lars,32889,165,75.0,brown,light,blue,47.0,female,feminine,Tatooine,Human
8,Biggs Darklighter,48171,183,84.0,black,light,brown,24.0,male,masculine,Tatooine,Human
10,Anakin Skywalker,33979,188,84.0,blond,fair,blue,41.9,male,masculine,Tatooine,Human
11,Wilhuff Tarkin,40831,180,,auburn,fair,blue,64.0,male,masculine,Eriadu,Human
12,Chewbacca,46714,228,112.0,brown,unknown,blue,200.0,male,masculine,Kashyyyk,Wookiee


### Select Variables

Select or remove variables from the dataframe

In [17]:
# keep variables
sw_select = sw_df[['name', 'height', 'mass']]

sw_select

Unnamed: 0,name,height,mass
0,Luke Skywalker,172,77.0
1,C-3PO,167,75.0
2,R2-D2,96,32.0
3,Darth Vader,202,136.0
4,Leia Organa,150,49.0
...,...,...,...
82,Rey,164,50.0
83,Poe Dameron,190,70.0
84,BB8,35,
85,Captain Phasma,225,155.0


In [18]:
# remove variables
sw_not_select = sw_df.drop(columns=["height", "mass"])

sw_not_select

Unnamed: 0,name,character_id,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,31543,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,34621,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,38280,,white,red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,16914,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,32723,brown,light,brown,19.0,female,feminine,Alderaan,Human
...,...,...,...,...,...,...,...,...,...,...
82,Rey,49618,brown,light,hazel,,female,feminine,Jakku,Human
83,Poe Dameron,21382,brown,light,brown,,male,masculine,Yavin 4,Human
84,BB8,44128,none,none,black,,none,masculine,,Droid
85,Captain Phasma,38297,unknown,unknown,unknown,,unkown,,Parnassos,


<h3>Assign</h3>
Create new columns or modify existing ones in a dataframe. 

The Body Mass Index (BMI) is calculated as $BMI = \frac{mass\ (kg)}{(height\ (m))^2}$.

NOTE: if you name your variable as an existing variable, it will overwrite the existing variable. If you give it a new name, it will create a new variable

In [19]:
#sw_df = sw_df.assign(bmi=sw_df['mass'] / ((sw_df['height']/100) **2))
sw_df['bmi'] = sw_df['mass'] / ((sw_df['height']/100) **2)
sw_df

Unnamed: 0,name,character_id,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species,bmi
0,Luke Skywalker,31543,172,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human,26.027582
1,C-3PO,34621,167,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid,26.892323
2,R2-D2,38280,96,32.0,,white,red,33.0,none,masculine,Naboo,Droid,34.722222
3,Darth Vader,16914,202,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human,33.330066
4,Leia Organa,32723,150,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human,21.777778
...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,49618,164,50.0,brown,light,hazel,,female,feminine,Jakku,Human,18.590125
83,Poe Dameron,21382,190,70.0,brown,light,brown,,male,masculine,Yavin 4,Human,19.390582
84,BB8,44128,35,,none,none,black,,none,masculine,,Droid,
85,Captain Phasma,38297,225,155.0,unknown,unknown,unknown,,unkown,,Parnassos,,30.617284


In [20]:
sw_df.columns

Index(['name', 'character_id', 'height', 'mass', 'hair_color', 'skin_color',
       'eye_color', 'birth_year', 'sex', 'gender', 'homeworld', 'species',
       'bmi'],
      dtype='object')

In [21]:
new_order=['name', 'character_id', 'height', 'mass', 'bmi', 'hair_color', 'skin_color',
       'eye_color', 'birth_year', 'sex', 'gender', 'homeworld', 'species']

sw_df = sw_df[new_order]
sw_df

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,31543,172,77.0,26.027582,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,34621,167,75.0,26.892323,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,38280,96,32.0,34.722222,,white,red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,16914,202,136.0,33.330066,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,32723,150,49.0,21.777778,brown,light,brown,19.0,female,feminine,Alderaan,Human
...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,49618,164,50.0,18.590125,brown,light,hazel,,female,feminine,Jakku,Human
83,Poe Dameron,21382,190,70.0,19.390582,brown,light,brown,,male,masculine,Yavin 4,Human
84,BB8,44128,35,,,none,none,black,,none,masculine,,Droid
85,Captain Phasma,38297,225,155.0,30.617284,unknown,unknown,unknown,,unkown,,Parnassos,


In [22]:
# transform height from centimeters into inches
sw_df['height'] = (sw_df['height'] / 2.54)
sw_df

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,31543,67.716535,77.0,26.027582,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,34621,65.748031,75.0,26.892323,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,38280,37.795276,32.0,34.722222,,white,red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,16914,79.527559,136.0,33.330066,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,32723,59.055118,49.0,21.777778,brown,light,brown,19.0,female,feminine,Alderaan,Human
...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,49618,64.566929,50.0,18.590125,brown,light,hazel,,female,feminine,Jakku,Human
83,Poe Dameron,21382,74.803150,70.0,19.390582,brown,light,brown,,male,masculine,Yavin 4,Human
84,BB8,44128,13.779528,,,none,none,black,,none,masculine,,Droid
85,Captain Phasma,38297,88.582677,155.0,30.617284,unknown,unknown,unknown,,unkown,,Parnassos,


<h3>Arrange / Sort Variables</h3>
<p>Rearrange the observations in a column</p>

In [23]:
#oldest characters

sw_df.sort_values(by="birth_year", ascending=False)

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
18,Yoda,48746,25.984252,17.0,39.026630,white,green,brown,896.0,male,masculine,,Yoda's species
15,Jabba Desilijic Tiure,37601,68.897638,1358.0,443.428571,,green-tan,orange,600.0,hermaphroditic,masculine,Nal Hutta,Hutt
12,Chewbacca,46714,89.763780,112.0,21.545091,brown,unknown,blue,200.0,male,masculine,Kashyyyk,Wookiee
1,C-3PO,34621,65.748031,75.0,26.892323,,gold,yellow,112.0,none,masculine,Tatooine,Droid
63,Dooku,22740,75.984252,80.0,21.477087,white,fair,brown,102.0,male,masculine,Serenno,Human
...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,Finn,11621,72.440945,65.0,19.198960,black,dark,dark,,male,masculine,,Human
82,Rey,49618,64.566929,50.0,18.590125,brown,light,hazel,,female,feminine,Jakku,Human
83,Poe Dameron,21382,74.803150,70.0,19.390582,brown,light,brown,,male,masculine,Yavin 4,Human
84,BB8,44128,13.779528,,,none,none,black,,none,masculine,,Droid


In [24]:
#Characters with the same skin color then the same hair color

sw_df.sort_values(by=['skin_color','hair_color'], ascending = [False, True])

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
60,Luminara Unduli,25280,66.929134,56.2,19.446367,black,yellow,blue,58.0,female,feminine,Mirial,Mirialan
61,Barriss Offee,37584,65.354331,50.0,18.144869,black,yellow,blue,40.0,female,feminine,Mirial,Mirialan
3,Darth Vader,16914,79.527559,136.0,33.330066,none,white,yellow,41.9,male,masculine,Tatooine,Human
45,Gasgano,18696,48.031496,57.0,38.296157,none,white,black,,male,masculine,Troiken,Xexto
53,Yarael Poof,14662,103.937008,102.0,14.634986,none,white,yellow,,male,masculine,Quermia,Quermian
...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,Grievous,35569,85.039370,159.0,34.079218,none,brown,"green, yellow",,male,masculine,Kalee,Kaleesh
37,Watto,29001,53.937008,50.0,26.639672,black,blue,yellow,,male,masculine,Toydaria,Toydarian
43,Ayla Secura,30422,70.078740,55.0,17.358919,none,blue,hazel,48.0,female,feminine,Ryloth,Twi'lek
44,Dud Bolt,33496,37.007874,45.0,50.928022,none,blue,yellow,,male,masculine,Vulpter,Vulptereen


<h3>Groupby & Aggregate</h3>
<p>Groupby groups your observations in one or more variables.</p>
<p>Aggregate combine multiple data values into a single summary statistic. For example, finding the sum, mean, median, minimum, maximum, or standard deviation of a group.</p>

In [25]:
sw_group = sw_df.groupby('sex')['height'].mean()
#.reset_index().rename(columns={'height':'avg_height'})

sw_group

sex
female            66.510827
hermaphroditic    68.897638
male              70.524934
none              45.341207
unkown            75.688976
Name: height, dtype: float64

In [26]:
sw_group = sw_df.groupby('sex').agg(
    avg_height=('height','mean'),
    count=('sex','count')
).reset_index()

sw_group

Unnamed: 0,sex,avg_height,count
0,female,66.510827,16
1,hermaphroditic,68.897638,1
2,male,70.524934,60
3,none,45.341207,6
4,unkown,75.688976,4


<h3>Recode Variables</h3>
<p>Transform the values of a variable into new values based on specific criteria</p>

In [27]:
sw_df["gender_num"] = sw_df["gender"].map({
    "masculine": 0,
    "feminine": 1
})

sw_df

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species,gender_num
0,Luke Skywalker,31543,67.716535,77.0,26.027582,blond,fair,blue,19.0,male,masculine,Tatooine,Human,0.0
1,C-3PO,34621,65.748031,75.0,26.892323,,gold,yellow,112.0,none,masculine,Tatooine,Droid,0.0
2,R2-D2,38280,37.795276,32.0,34.722222,,white,red,33.0,none,masculine,Naboo,Droid,0.0
3,Darth Vader,16914,79.527559,136.0,33.330066,none,white,yellow,41.9,male,masculine,Tatooine,Human,0.0
4,Leia Organa,32723,59.055118,49.0,21.777778,brown,light,brown,19.0,female,feminine,Alderaan,Human,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,49618,64.566929,50.0,18.590125,brown,light,hazel,,female,feminine,Jakku,Human,1.0
83,Poe Dameron,21382,74.803150,70.0,19.390582,brown,light,brown,,male,masculine,Yavin 4,Human,0.0
84,BB8,44128,13.779528,,,none,none,black,,none,masculine,,Droid,0.0
85,Captain Phasma,38297,88.582677,155.0,30.617284,unknown,unknown,unknown,,unkown,,Parnassos,,


<h3>Rename Variables</h3>
<p>Rename the column</p>

In [28]:
sw_df = sw_df.rename(columns={'gender':'gender_label'})
sw_df

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender_label,homeworld,species,gender_num
0,Luke Skywalker,31543,67.716535,77.0,26.027582,blond,fair,blue,19.0,male,masculine,Tatooine,Human,0.0
1,C-3PO,34621,65.748031,75.0,26.892323,,gold,yellow,112.0,none,masculine,Tatooine,Droid,0.0
2,R2-D2,38280,37.795276,32.0,34.722222,,white,red,33.0,none,masculine,Naboo,Droid,0.0
3,Darth Vader,16914,79.527559,136.0,33.330066,none,white,yellow,41.9,male,masculine,Tatooine,Human,0.0
4,Leia Organa,32723,59.055118,49.0,21.777778,brown,light,brown,19.0,female,feminine,Alderaan,Human,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,49618,64.566929,50.0,18.590125,brown,light,hazel,,female,feminine,Jakku,Human,1.0
83,Poe Dameron,21382,74.803150,70.0,19.390582,brown,light,brown,,male,masculine,Yavin 4,Human,0.0
84,BB8,44128,13.779528,,,none,none,black,,none,masculine,,Droid,0.0
85,Captain Phasma,38297,88.582677,155.0,30.617284,unknown,unknown,unknown,,unkown,,Parnassos,,


<h3>Missing Values</h3>

In [29]:
#check for missing data

missing_values = sw_df.isna()
missing_values

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender_label,homeworld,species,gender_num
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,False,False,False,False,False,False,False,False,True,False,False,False,False,False
83,False,False,False,False,False,False,False,False,True,False,False,False,False,False
84,False,False,False,True,True,False,False,False,True,False,False,True,False,False
85,False,False,False,False,False,False,False,False,True,False,True,False,True,True


In [30]:
# number of missing data

number_missing = sw_df.isna().sum()
number_missing

name             0
character_id     0
height           0
mass             7
bmi              7
hair_color       5
skin_color       0
eye_color        0
birth_year      44
sex              0
gender_label     4
homeworld        7
species          4
gender_num       4
dtype: int64

### Drop NAs

In [31]:
sw_df.shape

(87, 14)

In [32]:
sw_df

Unnamed: 0,name,character_id,height,mass,bmi,hair_color,skin_color,eye_color,birth_year,sex,gender_label,homeworld,species,gender_num
0,Luke Skywalker,31543,67.716535,77.0,26.027582,blond,fair,blue,19.0,male,masculine,Tatooine,Human,0.0
1,C-3PO,34621,65.748031,75.0,26.892323,,gold,yellow,112.0,none,masculine,Tatooine,Droid,0.0
2,R2-D2,38280,37.795276,32.0,34.722222,,white,red,33.0,none,masculine,Naboo,Droid,0.0
3,Darth Vader,16914,79.527559,136.0,33.330066,none,white,yellow,41.9,male,masculine,Tatooine,Human,0.0
4,Leia Organa,32723,59.055118,49.0,21.777778,brown,light,brown,19.0,female,feminine,Alderaan,Human,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,49618,64.566929,50.0,18.590125,brown,light,hazel,,female,feminine,Jakku,Human,1.0
83,Poe Dameron,21382,74.803150,70.0,19.390582,brown,light,brown,,male,masculine,Yavin 4,Human,0.0
84,BB8,44128,13.779528,,,none,none,black,,none,masculine,,Droid,0.0
85,Captain Phasma,38297,88.582677,155.0,30.617284,unknown,unknown,unknown,,unkown,,Parnassos,,


In [33]:
sw_dropNA = sw_df.dropna()
print(sw_dropNA.shape)

print(sw_dropNA['bmi'].mean())

(32, 14)
24.134537332587648


In [34]:
sw_dropNA_var = sw_df.dropna(subset=['bmi'])
print(sw_dropNA_var.shape)

print(sw_dropNA_var['bmi'].mean())

(80, 14)
29.42650214774587


### Replace NAs

In [35]:
sw_df
sw_df["gender_label"] = sw_df["gender_label"].fillna("unknown")

In [36]:
#get value counts for a variable

sw_df["gender_label"].value_counts()

gender_label
masculine    66
feminine     17
unknown       4
Name: count, dtype: int64

### Export cleaned data

In [37]:
# index=False prevents pandas from writing the row index as an extra column
sw_df.to_csv("star_wars_cleaned.csv", index=False)