# Pandas - Creating a Simple Dataframe

Lets import pandas and numpy.

In [1]:
import pandas as pd
import numpy as np

I create 2 variables, x and y, where x is a list of numbers from 1 to 10, and y is squares of the numbers in x.

In [2]:
x = np.arange(1, 11)
y = x * 2

Notice that when I print out x and y, it prints out lists, in a horizontal way. Below we will see that if I had created x and y as pandas series, it will print in a vertical way, a better way to see the values.

I have found this useful when I imagine that dataframe is just a bunch of series. 

Another way is to understand that underneath numpy is python, and underneath pandas is numpy. Python list is optimised as pandas series using numpy arrays.

In [3]:
x

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [4]:
y

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

A dataframe is simply a set of lists with keys, so a dictionary is the nearest choice I can think of to create a simple dataframe. The key names become column names - Here, x and y which were keys are displayed as column names.

In [5]:
df = pd.DataFrame({'x':x, 'y':y})
df

Unnamed: 0,x,y
0,1,2
1,2,4
2,3,6
3,4,8
4,5,10
5,6,12
6,7,14
7,8,16
8,9,18
9,10,20


Below is a demonstration of creating a similar simple dataframe as above but using pandas series.

Notice the displays of x1 and y1 - they are displayed as column vector. This concept is quite useful in machine learning when I need to vectorise datasets.

Vectorisation helps speed up performance. Instead of running aroung in for loops, it does computations in one step. 

Disclaimer: I understand that vectorisation uses parallel processing so perhaps it's not right to say computations happen in one step, it may be more like computations are distributed for faster performance.

In [6]:
x1 = pd.Series(np.arange(1, 11))
x1

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int32

This is an example of vectorisation. Imagine this using a for loop - I have to write a few lines of commands to loop through the length of the array size of x1.

In [7]:
y1 = x1 * 2
y1

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int32

Here, dataframe is displayed - keep in mind that x1 and y1 were series, so series creates dataframe - that's the takehome message.

In [8]:
df1 = pd.DataFrame({'x1':x1, 'y1':y1})
df1

Unnamed: 0,x1,y1
0,1,2
1,2,4
2,3,6
3,4,8
4,5,10
5,6,12
6,7,14
7,8,16
8,9,18
9,10,20
