Elements of Data Science

by [Allen Downey](https://allendowney.com/)

[MIT License](https://opensource.org/licenses/MIT)

## Goals

In the previous notebook,you learned about variables and two kinds of values:integers and floating-point numbers.

In this notebook,you'll see some additional types:

* Strings,which contain sequences of letters,are used to represent text data.
* Timestamp values are used to represent dates and times.
* And several ways to represent and display geographical locations.

Not every data science project uses all of these types,but many projects use at least one.

## A note

In these notebooks,I generally try to define terms when they are first used,and to explain the code examples as we go along.

However,there are a few places where I provide code with the expectation that you will not understand all of the details yet.When that happens,I'll explain what you should or should not understand.

If you are able to do the exercises,that probably means you are understanding what you need to understand.

## Strings

A **string** is a sequence of letters,numbers,and punctuation marks.

In Python you can create a string by typing letters between single or double quotation marks.

In [None]:
'Data'

In [None]:
"Science"

And you can assign string values to variables.

In [None]:
first = 'Data'

In [None]:
last = "Science"

Some arithmetic operators work with strings,but they might no do what you expect.For example,the `+` operator "concatenates" two strings;that is,it creates a new string that contains the first string followed by the second string:

In [None]:
first + last

If you want to put a space between the words,you can use a string that contains a space:

In [None]:
first + ' ' + last

Strings are used to store text data like names,address,titles,etc.

When you read data from a file,you might see values that look like numbers,but they are actually strings,like this:

In [None]:
no_actually_a_number = '123'

If you try to do math with these strings,you might get an error:

In [None]:
no_actually_a_number + 1

Or you might get a surprising result:

In [None]:
no_actually_a_number + '1'

Fortunately,you can convert strings to numbers.

If you have a string that contains only digits,you can convert it to an integer using the `int` function:

In [None]:
int('123')

Or you can convert it to a floating-point number using `float`:

In [None]:
float('123')

But if the string contains a decimal point,you can't convert it to an `int`:

In [None]:
int('12.3')

Going in the other direction,you can convert almost any type of value to a string using `str`:

In [None]:
str(123)

In [None]:
str(12.3)

**Exercise:** When personal names are stored in a database,they are usually stored in three variables: a given name,a family name,and sometimes a middle name. For example,a list of great rock drummers might include:

In [None]:
given = 'Neil'
middle = 'Ellwood'
family = 'Peart'

But names are often displayed different was in different contexts.For example,the first time you mention someone in an article,you might give all three names,like"Neil Ellwood Peart".But in the index of a book,you might put the family name first,like "Peart,Neil Ellwood".

Write Python expressions that use the variables `given`,`middle`,and `family` to display Neil Peart's name in these two formats.

In [23]:
# Solution goes here
given + ' ' + middle + ' ' +family

'Neil Ellwood Peart'

In [24]:
# Solution goes here
family + ',' + given + ' ' + middle

'Peart,Neil Ellwood'

## Dates and times

If you read data from a file,you might alse find that dates and times are represented with strings.

In [None]:
not_really_a_date = 'June 4,1989'

To confirm that this value is a string,we can use the `type` function,which takes a value and reports its type.

In [25]:
type(not_really_a_date)

str

`str` indicates that the value of `not_really_a_date` is a string.

We get the same result with `not_really_a_time`,below:

In [26]:
not_really_a_time = '6:30:00'

In [27]:
type(not_really_a_time)

str

Representing dates and times using strings provides human-readable values,but they are not useful for doing computation.

Fortunately,Python provides libraries for working with date and time data;the one we'll use is called Pandas.

As always,we have to import a library before we use it;it is conventional to import Pandas with the abbreviated name `pd`:

In [31]:
import pandas as pd

Pandas provides a type called `Timestamp`,which represents a date and time.

It also provides a function called `Timestamp`,which we can use to convert a string to a `Timestamp`:

In [32]:
pd.Timestamp('6:30:00')

Timestamp('2021-05-20 06:30:00')

Or we can do the same thing using the variable defined above.

In [33]:
pd.Timestamp(not_really_a_time)

Timestamp('2021-05-20 06:30:00')

In this example,the string specifies a time but no date,so Pandas fills in today's date.

A `Timestamp` is a value,so you can assign it to a variable.

In [34]:
date_of_birth = pd.Timestamp('June 4,1989')
date_of_birth

Timestamp('1989-06-04 00:00:00')

If the string specifies a date but no time,Pandas fills in midnight as the default time.

If you assign the `Timestamp` to a variable,you can use the variable name to get the year,month,and day,like this:

In [35]:
date_of_birth.year, date_of_birth.month, date_of_birth.day

(1989, 6, 4)

You can also gets the name of the month and the day of the week.

In [36]:
date_of_birth.day_name(), date_of_birth.month_name()

('Sunday', 'June')

`Timestamp` provides a function called `now` that returns the current date and time.

In [38]:
now = pd.Timestamp.now()
now

Timestamp('2021-05-20 09:59:28.089028')

**Exercise:** Use the value of `now` to display the name of the current month and day of the week.

In [40]:
# Solution goes here
now.month_name(),now.day_name()

('May', 'Thursday')

## Timedelta

`Timestamp` values support some arithmetic operations.For example,you can compute the difference between two `Timestamps`:

In [41]:
age = now - date_of_birth
age

Timedelta('11673 days 09:59:28.089028')

The result is a `Timedelta` that represents the current age of someone born on `date_of_birth`.

The `Timedelta` contains `components` that store the number of days,hour,etc.between the two `Timestamp` values.

In [42]:
age.components

Components(days=11673, hours=9, minutes=59, seconds=28, milliseconds=89, microseconds=28, nanoseconds=0)

You can get one of the components like this:

In [44]:
age.days

11673

The biggest component of `Timedelta` is days,not years,because days are well defined and years are problematic.

Most years are 365 days,but some are 366.The average calendar year is 365.24 days,which is a very good approximation of a solar year,[but it is not exact.](https://pumas.jpl.nasa.gov/files/04_21_97_1.pdf)

One way to compute age in years is to divide age in days by 365.24:

In [45]:
age.days / 365.24

31.959807250027378

But people usually report their ages in integer years.We can use the Numpy `floor` function to round down:

In [46]:
import numpy as np

np.floor(age.days / 365.24)

31.0

Or the `ceil` function (which stands for "ceiling") to round up:

In [47]:
np.ceil(age.days / 365.24)

32.0

We can also compare `Timestamp` values to see which comes first.

For example,let's see if a person with a given birthdate has already had a birthday this year.

We can create a new `Timestamp` with the year from `now` and the month and day from `date_of_birth`.

In [48]:
bday_this_year = pd.Timestamp(now.year, date_of_birth.month, date_of_birth.day)
bday_this_year

Timestamp('2021-06-04 00:00:00')

The result represents the person's birthday this year.Now we can use the `>` operator to check whether `now` is later than the birthday:

In [49]:
now > bday_this_year

False

The result is either `True` or `False`,which are special values in Python used to represent result from this kind of comparison.

These values belong to a type called `bool`, short for "Boolean algebra",which is a branch of algebra where all values are either true or false.

In [50]:
type(True)

bool

In [51]:
type(False)

bool

**Exercise:** Any two people with different birthdays have a "Double Day" when one is twice as old as the other.

Suppose you are given two `Timestamp` values,`d1` and `d2`,that represent birthdays for two people. Use `Timestamp` arithmetic to compute their double day.

Here are two example dates;with these dates,the result should be December 19,2009.

In [52]:
d1 = pd.Timestamp('2003-07-12')

In [53]:
d2 = pd.Timestamp('2006-09-30')

In [54]:
# Solution goes here

## Location

There are many ways to represent geographical locations,but the most common,at least for global data,is latitude and longitude.

When stored as strings,latitude and longitude are expressed in degrees with compass directions N,S,E,and W. for example,this string represents the location of Boston,MA,USA:

In [58]:
lat_lon_string = '42.3601° N, 71.0589° W'

When we compute with location information,we use floating=point numbers,with

* Positive latitude for the northern hemisphere,negative latitude for the southern hemisphere,and

* Positive longitude for the eastern hemisphere and negative latitude for the western hemisphere.

Of course,the choice of the origin and the orientation of positive and negative are arbitrary choices that were made for historical reasons. We might not be able to change conventions like these,but we should be aware that they are conventions.

Here's how we might represent the location of Boston with two variables.

In [59]:
lat = 42.3601
lon = -71.0589

It is also possible to combine two numbers into a composite value and assign it to a single variable:

In [60]:
boston = lat,lon
boston

(42.3601, -71.0589)

The type of this variable is `tuple`, which is a mathematical term for a value that contains a sequence of elements.Math people pronounce it "tuh'ple",but computational people usually say "too'ple".Take your pick.

In [61]:
type(boston)

tuple

If you have a tuple with two elements,you can assign them to two variables,like this:

In [62]:
y, x = boston
y

42.3601

In [63]:
x

-71.0589

Notice that I assigned latitude to `y` and longitude to `x`,because a `y` coordinate usually goes up and down like latitude,and an `x` coordinate usually goes side-to-side like longitude.

**Exercise:** Find the latitude and longitude of the place you were born or someplace you think of as your "home town".[You can use this web page to look it up](https://www.latlong.net/),among others.

Make a tuple of floating-point numbers that represents this location.

In [64]:
# Solution goes here

## Distance

If you are given two tuples that represent locations,you can compute the approximate distance between them,along the surface of the global,using the haversine function.

If you are curious about it,[you can read an explanation in this article](https://janakiev.com/blog/gps-points-distance-python/).

To estimate a haversine distance,we have to compute the haversine function,which is defined:
$\begin{aligned}haversine(θ) = sin^2(θ/2)\end{aligned}$

Where θ is an angle in radians.

We can compute this function in Python like this:

In [65]:
import numpy as np

θ = 1
np.sin(θ/2)**2

0.22984884706593015

You can use Greek letters in variable names,but there is currently no way to type them in Jupyter/Colab,so I usually copy them from a web page and paste them in.

To avoid the inconvenience,it is more common to write out letter names,like this:

In [66]:
theta = 1
np.sin(theta/2)**2

0.22984884706593015

**Exercise:** This is a good time to remind you that the operator for exponentiation is `**` .

In some other languages the operator for exponentiation is `^`. That is also an operator in Python,but it performs another operation altogether.

Try out the previous expression,replacing `**` with `^`,and see what error message you get.Remember this message in case you see it in the future!

In [67]:
# Solution goes here

## Defining functions

