# Introducing Data as Data: The Series 

Whenever we are talking about data herein, we are talking about regular or systematic measurements of phenomena. Anything can serve as data, if you are keen to observe it and can measure it in some reliable way. Social scientists have been coming up with all kinds of creative ways to measure phenomena. For example, to measure bias in hiring, Bertand and Mullainathain used call backs to resumes where the resumes were pretty much similar except in the use of racial or ethnically distinctive names (xx). To measure political engagement during a protest, Gonzalez-Bailon et al., measured frequency of tweets mentioning keywords (xx). Whatever it is we measure we will want to consider what is the unit of analysis. And if we have a single unit, we often then have multiples of that unit. We compare tweets, edits to Wikipedia, clicks on a page, times a light was turned on, etc. We can think of each unit as an object. What we first want to do is create a collection of objects. 

There are many different kinds of collections in Python. Different collections have different features and syntax. So one kind of collection, such as a list, might be _indexed_, which means that you can retrieve members of the collection with a sequential index. In Python the first index is zero. The list ```ll = ["alpha", "bravo", "charlie"]``` is indexed such that ```ll[0]``` returns ```"alpha"```. Other collections are _keyed_ meaning that they use key-value pairs. The keys don't need to be in any specific order. A dictionary ```dd = {"sun":"warm", "cloud":"cool"}``` will be keyed such that ```dd["cloud"]``` will return ```"cool"```. 

As we progress through the book, you will encounter increasingly complex combinations of indexed and keyed collections. In fact, a very large part of data science programming is knowing how to effectively use the right kind of collection for a task. 

A simple example of a collection could be a range of numbers. Say, from 0 to 10. We would literally use a ```range``` object, such as ```range(10)``` for the first ten integers starting with $0$, or ```range(1,20,4)``` for numbers starting from $1$ up to $20$, going $4$ at a time. A more complex example of a collection could be a comment tree for a _reddit_ post. This comment tree will be ordered in a different way from a range of numbers. With a module called ```praw``` which we will discuss in depth later, you can download the top level comments for a reddit post, then capture all the comments underneath, and so forth, while preserving the comment structure.

When we want to learn about a phenomenon we usually need to transform data from one structure where we collect or measure something into another structure that can allow us to gather insights and make claims. For example, we might take a set of survey responses and transform them into a table so we can get a sense of how strongly people feel about an issue, or learn which subsets of people feel more or less strongly about the issue. We might have a set of speeches as text files, but we will want to make claims about freuqencies of words or features of the speeches, like how many different words were used. That means calculating some measures, such as lexical diversity and putting them in a table for comparison.

The process of transforming data is called __"data wrangling"__ and it is the most pervasive part of data science. Some researchers will use very sophisticated machine learning on texts, others will use data visualisations to discover and communicate insights, while others will just merge and filter data to make comparisons in tables. But everybody will wrangle data.

When wrangling, lists and dictionaries are alright, but they lack some useful features. For example, wouldn't it be nice to have a collection that both has an index, so that you can count to the $i^{th}$ element and is keyed so that you can just ask for element $i$ by name? In Python, such a collection is a part of the ```pandas``` library. In fact, there are two such collections that can be indexed and keyed, and we will be seeing a lot of them. The first is the ```Series```. The ```Series``` has an index that you can set and it has an order fron $0$ to $n$. Importantly, a ```Series``` only has one one dimension. It is like a single list. Typically, we want at least two dimensions, like we would have in a spreadsheet. Not just a list of case IDs, but for each case ID we would want to know the age, location, number of followers, frequency of edits, date of last login, etc. So with a ```DataFrame```, we treat each case as a row and each feature we want to measure as a column. 

Below we will first introduce the features of the Series and then we will introduce the DataFrame. Then in the next chapter we will introduce some common data structures found on the web and show how to transform them into DataFrames. Then in every chapter that follows we will use DataFrames in some fashion to pose and answer questions about data. 

Before we get started you might be asking, can _everything_ be done in these dataframes? No, not everything that we wish to do with programming is best done in DataFrames. Later on throughout the book, we will be using different kinds of objects when they are fit for purpose. Despite this, when we want to make a scientific claim, we will want to extract data from these different kinds of objects and usually wrangle it into a DataFrame to make some sort of comparison. DataFrames enable us to use statistical measures to pose questions like "do we see more or less of something than we would expect", "does something change over time", or "what different things seem to consistently go together"? So although not everything can be done in DataFrames, most of what you will want to do will involve them at some point. 

Before we get to the DataFrame, however, it makes sense to start with the Series. Then we will see how DataFrames are really just collections of Series objects. Consequently, to effectively get data into a DataFrame, to get data out a DataFrame, to filter that data, or to merge it with another table of data, you will often need to use a Series. 

# The Series 

The Series is like a list but the index can be labeled and you can give the Series a name. A Series is a class of object in Python within the ```pandas``` library. Therefore you can import and then create an empty Series in two ways: 

~~~ python 
from pandas import Series 
ser1 = Series()
~~~
or
~~~ python
import pandas as pd
ser1 = pd.Series()
~~~

The former is best when the Series is the only thing you want to import from pandas. However, in most cases we want to import the series, the DataFrame and maybe some helper methods, so in my code I tend to use the second approach. In case you didn't see this before, ```as``` is a way to give a library a different, typically shorter name. So in this book you will see it used here, ```import pandas as pd``` and later, for example, I use it in ```import beautifulsoup4 as bs4```. 

The empty series with a default index is not very useful on its own. So we will instead create a Series with some data. Let's start with the days of the week. A list of these days would be:

~~~ python
lweekdays = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
~~~

To turn this into a series we could write: 

~~~ python 
sweekdays = pd.Series(lweekdays,name="Weekdays")

display(sweekdays)

> 0       Monday
> 1      Tuesday
> 2    Wednesday
> 3     Thursday
> 4       Friday
> 5     Saturday
> 6       Sunday
~~~

This will transform the list into a Series with the default name (```None```) and the default index (in this case 0 through 6, since we have 7 elements). 

However, now imagine that instead of simply listing the days of the week, we want a list where we have some _measurements_, based on days. We could count something per day, like the "hours of sleep that night". So we could use a dictionary with the keys as days and the number of hours slept that night as the values. 

~~~ python 
dsleephours =  {"Sunday":8,
                "Monday":7,
                "Tuesday":5,
                "Wednesday":6,
                "Thursday":8,
                "Friday":9,
                "Saturday":8}
~~~

Now we can create a Series from that dictionary, much in the way we did with a list. Except this time, the keys will be the indices rather than the numbers 0 through 6. Observe: 

~~~ python 
sleephours = pd.Series(dsleephours)

display(sleephours) 

> Sunday       8
> Monday       7
> Tuesday      5
> Wednesday    6
> Thursday     8
> Friday       9
> Saturday     8
~~~

With this series we can now start to explore data. There are three ways in which we tend to extract data from a Series: by index, by value, as a distribution. See a description of each below. 

## Working from index

Working from index means that we will start with an index and get a value. For example, we might want to know if we got 8 hours of sleep on Tuesday. To work from index means we get data by querying the series based on the index. Observe:

~~~ python 
display(sleephours["Tuesday"])
> 5

# OR

display(sleephours[2])
> 5
~~~ 

It seems that Tuesday night was a rough night, as the data shows only 5 hours of sleep. 
 
What happens if the index itself is comprised of integers? In that case, it will function like a label. See the notebook for examples of this gotcha. But as a rule, try to avoid using integer numbers as indices unless they are sequential and start from 0. If you absolutely must use a number, try using a string version. "0" is a string, whereas $0$ is a number.  
 

## Working from values (and Slicing)
Working from values means that we will start with a value or set of values and discover the related indices. This is typically how we slice and filter data. For example, we might want to filter a series of Twitter accounts down to those who have been reported as bots. In this case, we would have a series with the Twitter account name as the index and the value of ```True``` or ```False``` for ```is_bot``` as the values of the Series. 

In our case, we have a series with hours of sleep as the values. We could then ask what night entailed greater than 7 hours of sleep. By using a boolean operator (which evaluates to ```True``` or ```False```) we then get a new Series with the result of that query for each night. It has the same indices but the values are now ```True``` or ```False``` for whether the value greater than 7.  

~~~ python 
display(sleephours > 7)

> Sunday        True
> Monday       False
> Tuesday      False
> Wednesday    False
> Thursday      True
> Friday        True
> Saturday      True
~~~

This series shows that on Monday, Tuesday, and Wednesday we observed 7 or less hours of sleep. 

Where this sort of Boolean logic is most useful is in slicing a Series or a DataFrame (it's the same principle in both, but we will go over this again with DataFrames).  

So we discovered that we can query by index with ```Series[index]```, and we discovered that we can create a new ```True```/```False``` series with a Boolean operator. What if we put the Boolean operator inside the query? Then we get a _slice_. So now instead of just asking for whether each day had 8 hours of sleep or more, we can query for _which_ days had 8 or more hours. What we are doing here is ```SERIES[ [SERIES == TRUTH_CONDITION ] ]``` to filter down the original series. Observe: 

~~~ python 
display(sleephours[sleephours >= 8])
 
> Sunday       True
> Thursday     True
> Friday       True
> Saturday     True
> Name: days_with_s, dtype: bool
~~~

Building up this chain further, we can ask how many days are in this new slice with ```len()``` (it's short for length). So if we want to know what proportion of days we observed the subject having 8 or more hours of sleep it could look like this: 

~~~ python 
days_sleep = len(sleephours[sleephours >= 8])
total_days = len(sleephours)
display(days_sleep / total_days) 
> 0.5714285714285714
~~~

- NOTE: I really don't recommend printing the full number as displayed in research. Instead, represent the number to a meaningful scale. In this case, perhaps $0.57$ would be useful. In Chapter xx we will look at how to render numbers for display as part of presenting data. 

## Working from distributions 
Working from distributions means that we will try to summarize the values in some way. A key distinction here is in the type of data, and particular, whether the data is numerical or not. If the data in the series is numerical, we can produce numerous statistical summaries of the data, such as the mean, median, mode, skewness. If the data is non-numerical, most of what we can do with a distribution is get the max, min, mode and use a command to create a table of values. 

With the original data on hours of sleep per night we might want to get a sense of how many days we had 5,6,7,8 or more hours of sleep. Alternatively, we may want to summarise the number of days with > 7 hours. To summarise a series by counting the number of unique entries we would use the ```value_counts()``` method. 

~~~ python
display(sleephours.value_counts())

> 8    3
> 7    1
> 6    1
> 5    1
> 9    1
    
(sleephours > 7).value_counts()

> True     4
> False    3
~~~

Notice that in the second case, we used the table of boolean values for ``` sleephours > 7``` and then summarised this in a ```value_counts()``` table. Beyond value counts are a huge number of possible statistical routines. One obvious one would be ```mean()``` (often also called the average). 

~~~ python 
display(sleephours.mean())
> 7.285714285714286

display(sleephours.max())
> 9
~~~

There are many statistical routines available for the series. We will explore these in more depth in Chapter xx, where we look at exploratory data. We will also look at these in depth again in Chapter xx where we visualise data. 

Tip. In Python, you can use the directory method to display all of the methods that an object can use. Some of these methods will be internal, system commands. They are prefixed by ```__``` and should not be refernced directly. The rest are meant to be used. By using directory we can see the difference in what we can do with a series versus a list. 

In the interest of saving space (and paper) we will not list off the methods here, but instead we will count them. Observe: 

~~~ python 
ex_list = [] # Just an empty list
ex_series = pd.Series(ex_list) # Now an empty series

display( len( dir(ex_list))) # Number of 
> 46

display( len( dir(ex_series)))
> 458
~~~

It appears there are almost ten times as many methods for a Series as for a list! Many of these will be useful for describing, shaping and analyzing data. 

## Adding data to a Series 

In many ways a Series works like a list, but one key difference between them is that a list automatically indexs values by position. Thus, if you have a 5 item list, ```ldemo```, then ```ldemo.append("TEXT")``` will then automatically append "TEXT" to the list and treat it as the sixth item in order.

Trying to append a value to a Series on the other hand will throw a ```TypeError``` error. Only a Series (or DataFrame) can be appended to a Series. This is because a Series expects index & data, not just data. We can append a Series or a DataFrame since these have explicit indices, whereas lists do not. In a list, by contrast, the index is implicit and pased solely on positon. 

This leads to two different strategies for adding data to a Series. The first is to create the entire Series as a primitive data type (such as a list or dictionary), then convert it to a Series and append to the original. The second uses the index to add new values one at a time. In this latter case, you have to stipulate the index of the new value. Be careful, you can also over write values of a series this way. Observe both of these strategies: 

~~~ python 
# Convert a list to a Series and append it to an existing Series.
# Step 1. Create Series 1. 
ldemo1 = ["Kermit","Piggy","Fozzie"]
sdemo1 = pd.Series(ldemo1) 

# Step 2. Create series 2. 
ldemo2 = ["Animal","Janice", "Dr. Teeth"]
sdemo2 = pd.Series(ldemo2) 

# Step 3. Append series 2. 
# Notice the 'ignore_index' argument. 
# Try running this without that argument (you will notice the index will be messed up)
sdemo1 = sdemo1.append(sdemo2,ignore_index=True)
display(sdemo1)
~~~

In the above, we first created a second series and then appended it to the first. Note that we also said ```sdemo1 = sdemo1.append(...```. This is because, by default, the append commend does not add the data to the original series. Instead, it creates a new series that combines the two earlier Series. If you don't assign that new Series to a variable it will disappear once it has been created. 

In the example below, we will first assume that the index is sequential. Then we will add the elements one at a time, with their index being one number higher than the highest current index value. How do we know what's the highest value? Since ```len()``` gives the length of the series, and the series starts at zero, then whatever length is will be the next number in the sequence. 

~~~ python 

ldemo1 = ["Kermit","Piggy","Fozzie"]
sdemo1 = pd.Series(ldemo1) 

# The second way, let's append the data one new index at a time.
ldemo2 = ["Animal","Janice", "Dr. Teeth"]

for i in ldemo2: 
    sdemo1[len(sdemo1)] = i
display(sdemo1)
~~~

This code might seem a little more straightforward, but it is not recommended for large tasks. The way that Series are stored means that you are actually creating a new series with a new index every time you append a single value. With four elements this makes virtually no difference but with hundreds of thousands of data points continually creating new Series with every loop will slow down code unnecessarily.

**NOTE**: In both cases, Python does not enforce unique indices, which can lead to surprises. For example, let's see what happens when we first create a series with a duplicate index (with the values 4, 5, and 4 rather than the defalt 0,1,2). Observe what happens when we assign a new value:  

~~~ python
sdemo = pd.Series(["Kermit","Piggy","Fozzie"],index=[4,5,4])
sdemo[4] = "Gonzo"
display(sdemo)

> 4    Gonzo
> 5    Piggy
> 4    Gonzo
~~~

Notice that in this case, since the index for both Kermit and Fozzie was '4', they were both replaced.

## Deleting Data from a Series

To delete a data from a series, you can either delete the data by index or you can create a new Series without the unwanted data. Or you can delete the data by index. We will first delete by index. Remember here that indices are assigned to the value, not automatically assigned by position. If you have a list ```ldemo = ["Mon","Tues","Weds"]``` and drop ```Tues```, then "Weds" is now in the second position. This is not the case for an index unless you deliberately re-index the new list.  

~~~ python
sdemo = pd.Series(["Kermit","Piggy","Fozzie"])
del sdemo[1]
display(sdemo)

> 0    Kermit
> 2    Fozzie
    
sdemo.index = range(len(sdemo))
display(sdemo)

> 0    Kermit
> 1    Fozzie
~~~ 

## Working with missing data in a Series 

A series can have missing data. Typically this data is signified by the NaN (numeric Python's "Not a Number" character, a.k.a. ```np.nan```). For example, if we create a series with an index going 0,1,2,3,4 and no data, then each of the columns will have a NaN value. 
~~~ python
sdemo = pd.Series(index=[0,1,2,3,4])
sdemo[0] = "Kermit"
sdemo[3] = "Fozzie"
display(sdemo)

> 0    Kermit
> 1       NaN
> 2       NaN
> 3    Fozzie
> 4       NaN
> dtype: object
~~~

Three things we tend to want to do when dealing with missing data. 
1. **Get rid of missing values**: use ```Series.dropna()```. Notice that this takes the argument ```inplace=True```. if you want to get rid of missing data in your Series use this argument. If you want to create a copy withough missing data and preserve the original, just omit that argument.
2. **Replace missing values**: use ```Series.fillna()```. This is for instances where we might have missing data and simply want to insert some value here. For example, if we have a count of number of laughs a specific muppet recieved in an episode, we might end up with missing data if the muppet did not get any laughts or did not appear. In which case, ```smuppet.fillna(0)``` will fill all the missing values with ```0```. 
3. **Filter in or out by missing values**. Rather than drop the missing values we often want to slice based on them. Here we can use ```Series.isna()``` inside a slice. 

Observe all three of these below:
~~~ python 
sdemo = pd.Series(index=[0,1,2,3]) # Create a Series with index and no values.
sdemo[0] = "Kermit"
sdemo[3] = "Fozzie"
display(sdemo)

> 0    Kermit
> 1       NaN
> 2       NaN
> 3    Fozzie

# Filling the N/A values
display(sdemo.fillna("extra"))

> 0    Kermit
> 1     extra
> 2     extra
> 3    Fozzie

# Dropping the N/A values
display(sdemo.dropna())

> 0    Kermit
> 3    Fozzie

# Slicing by the NA values
display(sdemo[sdemo.isna()])

> 1    NaN
> 2    NaN
~~~ 

## Getting unique values in a Series

Depending on the data, you might want to know whether or how many values are unique. Some examples:
1. Reading log traffic data: how many IP addresses are unique?
2. Getting a stream of tweets: how many accounts are unique?
3. Checking that an index has entirely unique values.

The ```Series.unique()``` command will return a new series with only one entry for each unique value. This will be returned as an "array", which is very similar to a list. To transform the array back into a series you will have to do that explicitly. 

~~~ python
ser1 = pd.Series(["Kermit","Fozzie","Kermit","Piggy","Fozzie"])
display(ser1.unique())

> array(['Kermit', 'Fozzie', 'Piggy'], dtype=object)

ser2 = pd.Series(ser1.unique()) # To transform back to a Series

~~~

# Sorting a Series 

A series can be sorted by the values (```Series.sort_values()```) or by the index (```Series.sort_index()```). The sort will be ascending by default, but you can change it with the argument ```ascending=False```. This is another method that requires the ```inplace=True``` argument. Otherwise, it will return a new, sorted, Series and leave the old one in place. 

~~~ python 
ser1 = pd.Series( {"Kermit":"Frog",
                   "Piggy":"Pig",
                   "Fozzie":"Bear",
                   "Robin":"Frog"} )

ser1.sort_values(ascending=True,inplace=True)
display(ser1)

> Fozzie    Bear
> Kermit    Frog
> Robin     Frog
> Piggy      Pig

ser2 = ser1.sort_index(ascending=False)
display(ser2)

> Robin     Frog
> Piggy      Pig
> Kermit    Frog
> Fozzie    Bear
~~~

## Changing Series Values I: Adding, Multiplying, etc...

With a series you can change the values using the standard arithmetic operators. These treat the series literally like a series of values and does something to each one. So for example, if you say ```Series + 1``` it will add one to each value in the series. If the series is not just numbers (and valid) it will throw an error. ```Series + "A"``` will append A to each value in the series if they are characters and throw an error otherwise. 

~~~ python
import numpy as np 
ser1 = pd.Series([1,np.NaN,7])

ser1 = ser1*2
display(ser1)
> 0     2.0
> 1     NaN
> 2    14.0

ser1 = ser1-4
display(ser1)
> 0    -2.0
> 1     NaN
> 2    10.0

ser1 = ser1 + "A" #Note that the Series is full of numbers so it throws an error
> "TypeError ..."

ser2 = pd.Series(["Kermit","Piggy","Fozzie"])
ser2 = ser2 + " the Muppet"
display(ser2)

> 0    Kermit the Muppet
> 1     Piggy the Muppet
> 2    Fozzie the Muppet
~~~

## Changing Series Values II: Recoding values using Map

A really common task in social statistics is to recode values. For example, you might have a list of text values (such as "Strongly Agree", "Agree", "Disagree", etc...) that you want to turn into numbers. You might have a text entry form that you want to recode (such as "How do you identify your gender" with answers like "Man", "Female", "Cis male", "Transgendered male", "agender") which you might recode into more manageable categories. To recode these you can create a dictionary of values and then ```map``` those values on to your series. 

A scenario that I encountered in a data cleaning exercise had to do just this. We asked people to label the gender of persons behind Twitter accounts. They were all politicians, so there was no need to create a "not a person" flag. All the MPs were cisgendered, meaning they presented as the gender they were assigned at birth. But still, the coders gave six or seven different ways of writing what was essentially "Male", "Female", "Unknown". 

~~~ python
display(gender_series.unique())
> array(['Male', 'Man', 'Male (sex)', "Woman", "Female", "Female "], dtype=object)
~~~

To recode the data, we first did a ```unique()```, and then typed by hand the dictionary using those unique values. One thing that tripped us up was that someone had used ```"Female "``` with that trailing space. So here's what the resulting dictionary looks like: 

~~~ python
gender_recode_dict = {"Male":"M", 
                 "Man":"M",
                 "Male (sex)": "M",
                 "Woman":"F",
                 "Female":"F",
                 "Female ":"F"}
~~~

Then to map it on to the series the following was done: 

~~~ python
gender_recode = gender_series.map(gender_recode_dict)

gender_recode.value_counts()
> M        1046
> F        942
~~~


## Changing Series Values III: Defining your own recode using Lambda

In the above example, ```map()``` took in a dictionary and then mapped the keys found in the  series to the values found in the dictionary. But map is a whole lot more powerful when you can define your own function for what to do with the elements in a Series. As a trivial example, we might want to take every element in a Series of characters and transform it into lower case. At this point, if you had to do this, you might think to use a ```for``` loop. That is, for each value in the series, get the index, get the value, transform the value to lower case and then reinsert it. Not only is this very fragile (what about two values with the same index?) it doesn't take advantage of some of the speed improvements that can happen in the back end. 

With ```lambda``` we do this transformation inside of a map command. Lambda is fed a value and then returns some transformation of this value. Observe the example of a lambda command that squares every value in a Series: 

~~~ python 
ser1 = pd.Series([1,3,5],index=["one","three","five"]) 
ser1 = ser1.map(lambda val: val**2)
display(ser1)
> one        1
> three      9
> five      25
~~~

One of the most useful things about lambda is that you can embed your own functions in it. This is a fast and tidy way to do complex operations on values in a Series. For example, let's say we want to check if a blob of text has an email address in it. We first define a function to detect email addresses. We won't go into details here of how to do that (See chapter xx for an example of how to detect emails). What's important is that we know the ```has_email(TEXT)``` function works. It takes in some TEXT and then returns either ```True``` or ```False``` if it has an email. 

~~~ python

smessages = pd.Series(["Hey, catch me at bernie.hogan@oii.ox.ac.uk", 
                  "I once emailed steve@apple.com and got a reply", 
                  "I don't really use email"])

result = smessages.map(lambda : has_email(x)) 

display(result) 
> 0     True
> 1     True
> 2    False
~~~


## Summary: The Series 

The series is a very powerful tool for manipulating data. Like a list it has ordered values. Like a dictionary the values can be accessed by key (or in this case, by 'index'). We first showed how to create a series using a list and a dictionary (x.todo). We then showed how to add values to that new series. We can filter the series, delete specific values and transform all the values. 

These operations are extremely foundational to the act of using Python for data science tasks. In the exercise sheet, you will be given a series and asked to clean this data. Since we are just dealing with one dimension of data, we are not yet at the point where we can ask some really interesting questions, but at least we will have some of the basics down. One thing that is worth attending to in the exercises is how to chain a bunch of operations together. For example, you will not only get value_counts() on a Series, but then use that value_counts() to investigate how to transform, summarise, and explain the data.  

In the next section, we will look at DataFrames. These are tables of data with rows and columns. Each column is treated like a Series. Thus, you will discover that many of the operations that we have learned for the Series directly apply to the DataFrame. With the dataframe and some libraries we can do exceedingly powerful things with data. But first we have to learn how to get data in, get data out, and filter that data. Then in the next chapter, we will look at some file formats that can be usefully converted into DataFrames. 