In [1]:
#Run the following code to print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'

# Data Types

## Numbers
Previously, our examples used numeric data (although I didn't explicitly talk about it).

In [2]:
x = 5
type(x)

int

The data type, `int`, is for whole numbers and , `float`, is for numbers with decimals:

In [3]:
x = 3
type(x)

y = 4.3
type(y)

int

float

Numeric data is data we could do math with.

In [4]:
x + y
x - y
x * y
x / y

7.3

-1.2999999999999998

12.899999999999999

0.6976744186046512

*Don't worry about the unexpected number of decimals for -1.3 and 12.9. It turns out that python has a hard time representing floats in binary code, so it makes an approximation...this won't affect your results in any meaningful way.*

Not all data with the label "number" is a number:
* Is a phone number or social security number a number?
* ZIP codes only use numbers - but are they numbers?
**Key Question: Would you ever do math with the values?**

Why does this matter?

In [5]:
zip = 02467 #with a leading zero

SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers (3314146026.py, line 1)

## Strings

Data can also contain characters other than numbers. This "string" or "character" data must be in single or double quotes.

In [6]:
x = "test" #assigns a string to the variable x
x = 'test' #also assigns a string to the variable x
type(x)

str

But without quotes, it's not a string!

In [7]:
x = test

NameError: name 'test' is not defined

Python reads this and looks for a variable called `test`, which we never created.

You can't do math (or much analytics in general) with strings directly.

In [8]:
x = "test"
x * 2

y = "3"
y * 2   #the answer is probably not what you think

'testtest'

'33'

Strings need a starting " or ' and a matching ending " or '. This incorrect command starts a string with the " but never finishes it.

In [9]:
x = "test

SyntaxError: unterminated string literal (detected at line 1) (2535819004.py, line 1)

## Logical (Boolean)

Logical (Boolean) values are either `True` or `False`.

In [10]:
x = True
type(x)

bool

The value can explicitly be `True` or `False`. But they can also be the result of a logical expression.

In [11]:
x = 3   # Here, we're setting the value of x to 3

In [12]:
x == 2  # Here, we're checking to see if x is equal to 2
# Note: the test for equality, ==, is not the same as the assignment, =.

False

### Logical Operations

**Comparison operators** compare two values:
* Is equal to: `==` (e.g., `x == y`)
* Not equal to: `!=` (e.g., `x != y`)
* Greater than: `>` (e.g., `x > y`)
* Less than: `<` (e.g., `x < y`)
* Greater than or equal: `>=` (e.g., `x >= y`)
* Less than or equal: `<=` (e.g., `x <= y`)

**Logical operators** combine conditional statements:
* `and`: Returns `True` if both statements are true (e.g., `x<5 and x<10`)
* `or`: Returns `True` if one of the statement is true (e.g., `x<5 or x<4`)
* `not`: Reverse the result, returns `False` if the result is true (e.g., `not(x<5)`)

## Dates

Data analysis frequently uses dates. And dates can be complicated.

What does 4/1/2021 mean? Is this 4 January 2021 or 1 April 2021?

The `datetime` and `dateutil` modules are good for interpreting both dates and times.

In [13]:
import datetime
x = datetime.date(2021, 4, 1)
x

datetime.date(2021, 4, 1)

We still don't know whether 4 is the month or the day. Use the help function to see if you can figure it out:

You can also print the variable to see a more typical date format. And you can view the `.month` or `.day` attributes:

In [14]:
print(x)

2021-04-01


In [None]:
x.month
x.day

If you want to see which day of the week the date fell on, you can use the `.weekday()` method. (What's a method? It's just a function that only works on certain types of objects and it goes *after* the variable name instead of before it)

In [None]:
x.weekday() #Monday is 0

### Parsing date information

We can use functions to create dates from strings:

In [None]:
from dateutil import parser
x = parser.parse("4/1/2021")
x.month

Notice the default with `.parse()` is to put the month before the day. If you want to switch this, you can add the `dayfirst = True` parameter:

In [None]:
x = parser.parse("4/1/2021", dayfirst = True)
x.month

### Practice with dates: For February 20, 1991

* Create a variable called `python` with the date: February 20, 1991
     - FYI, this is the date on which python was first released
* Extract the year
* Extract the month
* Find the day of the week
* Find the number of days between February 20, 1991 and December 31, 1999

# Key analytics module: `pandas`

Analytics requires data. How do we get data into python? Python will import data from many, many sources:
* Database
* Websites
* Files (of many formats)

`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. 

![image.png](attachment:375aed04-6bd1-45c8-89aa-d93671337f22.png)

`pandas` integrates with many ﬁle formats or data sources (csv, excel, sql, json, parquet,…).

![image.png](attachment:e9c32afb-2a91-4322-b036-a1247edc72dc.png)

Functions with the prefix `read_*` import data. Similarly, the `to_*` functions store data.

## Importing data

Today, we'll be working with the file, "02_Sample.csv", which is in the DA3 folder on Jupyter Hub. Run the next code block to import pandas and read in the file:

In [None]:
import pandas
df = pandas.read_csv("02_Sample.csv")

*If you got an error message, check that you've spelled the file name correctly and remember it is case sensitive!*

## Exploring the Data

Notice that we saved the DataFrame to a variable...this allows us to work with it. And when you print the variable name, it gives you the first 5 and last 5 rows of data, along with information about how many rows and columns there are.

In [None]:
df

I find that printing the variable name is the quickest way to get basic information about the table. But you can view the `.shape` attribute or use the `.head()` and `.tail()` methods to get this information

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.tail()

You can also view the `.dtypes` attribute to see the data types for each column:

In [None]:
df.dtypes

The `.describe()` method provides summary statistics for the numeric columns:

In [None]:
df.describe()

We'll talk about ways to profile non-numeric data next class

## Practice

Now, use the `read_excel()` function from `pandas` to read in the file "02_Sample.xlsx" and explore the data to see if there are any differences between the .csv and .xlsx versions of the sample.

*Hint: when reading in the new file, I'd recommend assigning it to a different variable name so that you don't write over the `df` variable we created above.*

## Coda: File Formats

We've only scratched the surface, but these are the most common file formats. Python also imports from many other files:
* Other statistics packages such as SPSS, SAS, etc.
* Other delimiters besides commas, such as tab, etc.
* Complex structures such as JSON, XML, etc.

And if you have a file with a particular format not mentioned, you're likely not the first person to need it...search online!