# Creating DataFrames in Python

This is a Pandas DataFrame tutorial. We start by first importing the libraries that are needed:

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

A series is a one dimensional ndarray with axis labels. We note that when the object prints to console, the first column represents the line (like an Excel spreadsheet). To create a Series object:

In [31]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Operations between Series align values based on their associated index values - this is very similar to the way that MATLAB works:

In [32]:
s1 = pd.Series([1,2,3])
s2 = pd.Series([4,5,6])

s3 = s1 + s2
s3

0    5
1    7
2    9
dtype: int64

Next, we create our Pandas DataFrame:

In [33]:
df1 = pd.DataFrame(data=np.array([[1,2,3],[4,5,6]], dtype=int),columns=['A','B','C'])
df1

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


We can create a DataFrame by passing a numpy array, with a datetime index and labelled columns. First we create a Pandas datatime object:

In [34]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Using the datetime object that has been created, we now create a Pandas DataFrame:

In [35]:
df2 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df2

Unnamed: 0,A,B,C,D
2013-01-01,-1.059075,-0.353507,-0.113532,0.823414
2013-01-02,-0.938028,1.451641,1.723332,-2.083195
2013-01-03,-0.883789,-1.329716,-0.53252,0.80509
2013-01-04,-0.470843,-0.435266,0.734664,-0.375342
2013-01-05,0.029978,-1.225895,0.462518,0.418967
2013-01-06,2.788081,0.184233,0.210801,-0.701399


We can also create Pandas DataFrames using dictionaries:

In [36]:
df3 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4, dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df3

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


We can create a Pandas DataFrame by uploading a .csv file:

In [37]:
# This line ensures that we are using a relative path
# for the file - is this the best way to do this?
filename = os.getcwd() + '/Data/powergen.csv'
# Uploads the data using a .csv into a PandasData Frame
df4 = pd.read_csv(filename)
df4.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9
