In [1]:
import pandas as pd
import numpy as np

## Data Types

Lists and similar structures can by checked with either "type()" or "isinstance()". "type()" will tell you what data type you are dealing with, while "ininstance()" will return True or False based on type of data type you are supplying it with and the structure you are supplying as well.

We'll move through some examples quickly and then move onto dataframes and their columns.

First, even though you might have a list of all the same type of elements (homogenous), Python does not simply tell you what type of data the elements are. If you supply the list to "type()" it will simple tell you - not surprisingly - that it is a list!

In [11]:
list1 = [1,2,3,4,5]
type(list1)

list

To find the data type of a specific element, you need to reference it by index. In this case, since they are all integers, any index should return type "int".

In [14]:
type(list1[1])

int

But lists are flexible, so what if the list is not homogenous? Well, you'd expect the same results. Except, depending on the index, the individual element would have a different type.

In [15]:
list2 = ["Tammy", 2]
type(list2)

list

In [16]:
type(list2[0])

str

In [17]:
type(list2[1])

int

For longer lists, you could write a loop to print out the type of every element in the list.

For dataframes, you'll usually work with entire columns as a minimum. Pandas dataframes have three default data types: int. float and objects.  The first two are self-explanatory. The third, objects, is usually either a string or a mix of elements. It's usually good to check these out and cast the to something more specific.

Let's create a small example dataframe to try some things out. We'll pretend that we've pulled some data on one specific quest. It includes the players' names, if they started or completed the quest, if they abandoned the quest and their game co-ordinates.

In [56]:
import numpy as np

# creating a dataframe
df1 = pd.DataFrame({"Name": ["Tammy", "John", "Katherine", "Tom"],
				"Status": ["Complete", "Started", 2, "Complete"],
				"Abandoned": [False, True, np.nan, False],
				"Location": [90.33, 30.6, np.nan, 70.22]})

df1

Unnamed: 0,Name,Status,Abandoned,Location
0,Tammy,Complete,False,90.33
1,John,Started,True,30.6
2,Katherine,2,,
3,Tom,Complete,False,70.22


So, the data on Tammy and Tom seem to make sense. They both completed the quest - and they didn't abandon it - and they have a location. John's data is logical too. He started the quest, but abandoned it. He also has a location. Katherine's data is troubling. First we have a "2" as an entry under Status. It's nonsensical. Maybe related to this is the fact that we cannot tell if she abandoned the quest and we cannot tell where she is.

Let's see what this means for data types.

In [57]:
df1.dtypes

Name          object
Status        object
Abandoned     object
Location     float64
dtype: object

the Name column is "object". Remember, even if it is nice and clean, string data will default to "object". The same might be the case for the Status column. Abandoned is Boolean - again, not a default type. Lastly, the Location column is a float type and despite the "NaN", pandas defaulted it to float64.

Let's use a built-in function to try and correct these shortcomings.

In [58]:
df1.convert_dtypes().dtypes


Name          string
Status        object
Abandoned    boolean
Location     Float64
dtype: object

Now, pandas has recognized the Name column as string, the Abandoned column as Boolean. However, the Status column remains "object". Why? Because of that odd "2" value. The column is mixed - an integer with strings. A little more work is needed.

## Missing, Erroneous Data and Imputation

At this point, though this is a trivial example of just four rows of data, we can still demonstrate the concepts of dealing with missing and erroneous data. Often the choice you make will be heavily dependent on the context in which you find yourself.

In [59]:
df1

Unnamed: 0,Name,Status,Abandoned,Location
0,Tammy,Complete,False,90.33
1,John,Started,True,30.6
2,Katherine,2,,
3,Tom,Complete,False,70.22


First, we've mentioned the odd 2 value in Status. Python does not coerce data as it is input, so this remains in the somewhat unspecific category of 'Object'. One way to deal with this is to manually change the column to a string type, making the 2 a '2' and then having pandas replace the value with an acceptable string.

For our example, let's say Status at one time was categorized as 1 or 2 - Started or Complete respectively. We'll make the change. Then, we'll make the colum a string.

In [60]:
df1.loc[df1["Status"]==2,"Status"]="Complete"
df1

Unnamed: 0,Name,Status,Abandoned,Location
0,Tammy,Complete,False,90.33
1,John,Started,True,30.6
2,Katherine,Complete,,
3,Tom,Complete,False,70.22


You can see that the Status column in Katherine's row now reflects "Complete".

Next, we have two missing values - in in the Abandoned column and another in the Location column. Here we should consider imputation. Both represent two different issues. Remember, the solution you choose will be heavily reliant on the context.

For the Abandoned problem, imputation is tricky. Of all imputation methods, which makes logical sense? In this case, if we did a little digging, we might find that if a Quest is Complete, then logically the player did not abandon it. Again, this might not be considered true imputation, but if not, we'll bend the rules here to show a little more of pandas functionality. Let's change Abandoned to "False" based on the Status column.

In [61]:
df1.loc[df1["Status"]=="Complete","Abandoned"]="False"
df1

Unnamed: 0,Name,Status,Abandoned,Location
0,Tammy,Complete,False,90.33
1,John,Started,True,30.6
2,Katherine,Complete,False,
3,Tom,Complete,False,70.22


Success! Now, with the Location column, we have values were the normal imputation methods make more sense. We're going to make a copy of our little dataframe so we can try to methods. First, we'll just drop the row. Then, in the copy, we'll use the mean of the location values to impute the player's location. Again, for location, that is extremely random - just keep in mind that we're only doing it this way to demonstrate the functionality. In reality, there is probably a set location to be used in these cases in which case you'd take an approach more like the one we also used for the Abandoned column.

In [62]:
#copy the dataframe
df2 = df1
df2

Unnamed: 0,Name,Status,Abandoned,Location
0,Tammy,Complete,False,90.33
1,John,Started,True,30.6
2,Katherine,Complete,False,
3,Tom,Complete,False,70.22


We're going to drop that entire row (Katherine). You should always check to see if doing so would delete too many rows of your data. There's a quick way to "eyeball" this. In our little example, we will be dropping 1 of 4 rows - 25% - but retaining 75%.

In [63]:
#check number of rows NaN.
len(df2), len(df2.dropna())

(4, 3)

We have 4 rows, dropping down to 3 after dropping the NaNs.

In [64]:
df2 = df2.dropna()
df2

Unnamed: 0,Name,Status,Abandoned,Location
0,Tammy,Complete,False,90.33
1,John,Started,True,30.6
3,Tom,Complete,False,70.22


We've now gotten rid of the missing values.

Let's return to df1 to use a method where we keep that row using imputation and the column's mean value.

In [65]:
location_mean = df1["Location"].mean()
location_mean

63.71666666666667

So, the mean is 63.72. We'll replace the NaN with that value.

Note two things: First, the "inplace=True". This means we do no want to make this change and then have a copy of the dataframe. Secondly, we could also format this better. The mean adds numerous place after the decimal that doesn't make sense for a location.

In [66]:
df1["Location"].fillna(location_mean, inplace=True)
df1

Unnamed: 0,Name,Status,Abandoned,Location
0,Tammy,Complete,False,90.33
1,John,Started,True,30.6
2,Katherine,Complete,False,63.716667
3,Tom,Complete,False,70.22


## Scaling

We'll move to a large datatset in order to demonstrate the usefulness of scaling.

In [108]:
import random
np.random.seed(42)

dfrand1 = pd.DataFrame(np.random.randint(18, 100, size=1000), columns=["Age"])
dfrand1["Score"] = np.random.randint(0, 1000, size=1000)
dfrand1["Gender"] = random.choices(["Male", "Female"], k=len(dfrand1))
dfrand1

Unnamed: 0,Age,Score,Gender
0,69,459,Female
1,32,355,Male
2,89,323,Male
3,78,132,Male
4,38,887,Female
...,...,...,...
995,30,250,Male
996,79,382,Male
997,99,6,Male
998,77,458,Female


Now, we have 1000 lines of two sets of numbers. The Age column runs from 18 up to 100. The Score column runs from 0 to 1000. Lastly, we added a Gender column which we will use presently.

Let's start with the Age column. We will apply Min-Max scaling (normalization).

In [113]:
dfrand1["Age"] = (dfrand1["Age"] - dfrand1["Age"].min()) / (dfrand1["Age"].max() - dfrand1["Age"].min())
dfrand1

Unnamed: 0,Age,Score,Gender
0,0.629630,459,Female
1,0.172840,355,Male
2,0.876543,323,Male
3,0.740741,132,Male
4,0.246914,887,Female
...,...,...,...
995,0.148148,250,Male
996,0.753086,382,Male
997,1.000000,6,Male
998,0.728395,458,Female


Compare the Age column now to what it looked like previously.

For the Score column, we will the z-score approach.

In [115]:
dfrand1["Score"] = dfrand1["Score"] =( dfrand1["Score"] - dfrand1["Score"].mean() ) / dfrand1["Score"].std()
dfrand1

Unnamed: 0,Age,Score,Gender
0,0.629630,-0.135818,Female
1,0.172840,-0.484914,Male
2,0.876543,-0.592328,Male
3,0.740741,-1.233456,Male
4,0.246914,1.300845,Female
...,...,...,...
995,0.148148,-0.837367,Male
996,0.753086,-0.394284,Male
997,1.000000,-1.656399,Male
998,0.728395,-0.139175,Female


This is a different sort of result - but just as useful. Later, you'll see that you do not have to implement these manually as we just did. scikit-learn allows you to call "MinMaxScaler" which we used for the Age column and "StandardScaler" which we used for the Score column.

Lastly, in some cases, you cannot use the Gender column in its current format. You'll need one hot encoding. Let's see what happens when we do this.

In [117]:
y = pd.get_dummies(dfrand1["Gender"], prefix="Gender")
y.head()

Unnamed: 0,Gender_Female,Gender_Male
0,1,0
1,0,1
2,0,1
3,0,1
4,1,0


Now that you've encoded them, you can add these as columns in your original dataframe. Usually, at that point, you'd also drop the original Gender column as well.

In [123]:
dfrand2 = pd.concat([dfrand1, y], axis=1)
dfrand2 = dfrand2.drop(["Gender"], axis=1)
dfrand2

Unnamed: 0,Age,Score,Gender_Female,Gender_Male
0,0.629630,-0.135818,1,0
1,0.172840,-0.484914,0,1
2,0.876543,-0.592328,0,1
3,0.740741,-1.233456,0,1
4,0.246914,1.300845,1,0
...,...,...,...,...
995,0.148148,-0.837367,0,1
996,0.753086,-0.394284,0,1
997,1.000000,-1.656399,0,1
998,0.728395,-0.139175,1,0


There are numerous other methods for normalizing, standardizing and encoding. 