# Getting data into DataFrame

In real world data analysis problems, data resides in data sources such as text files, Excel Files, proprietry format data files, databases etc. It is necessary to import data from these sources into DataFrame. Pandas provides `read_` functions for importing data.

First, we discuss how to import data from the text files.

## Importing data from text files.

pandas provides several functions for reading *tabular data* into a DataFrame object. The `read_csv` function described below is a workhorse function for this purpose.

### `read_csv` function

This function is designed to read tabular data stored in **c**omma **s**eparated **v**alues (`.csv`) file. 

In [1]:
import pandas as pd
df = pd.read_csv("data1.txt")
df

Unnamed: 0,Age,Height,Weight
0,18,155,50
1,21,157,60
2,15,150,45
3,20,160,58
4,24,162,64


Note that the column indices are automaically taken form the first row, whreas the row indices are automatically taken as the positional indices.

In [2]:
df.dtypes

Age       int64
Height    int64
Weight    int64
dtype: object

In [3]:
df.Height

0    155
1    157
2    150
3    160
4    162
Name: Height, dtype: int64

When data contains row labels, they are automatically recognized due to the fact that *the first line contains one value less than the rest of the lines*.

In [4]:
df2 = pd.read_csv("data2.txt", skipinitialspace=True)
df2

Unnamed: 0,Age,Height,Weight
p1,18,155,50
p2,21,157,60
p3,15,150,45
p4,20,160,58
p5,24,162,64


Note the use of `skipinitialspace` parameter.

In [5]:
df2.columns

Index(['Age', 'Height', 'Weight'], dtype='object')

In [6]:
df2.Height

p1    155
p2    157
p3    150
p4    160
p5    162
Name: Height, dtype: int64

dtypes (Datatypes) of the  coumns are automatically inferred after data import. This is shown in the next example.

In [7]:
df3 = pd.read_csv("data3.txt", skipinitialspace=True)
df3

Unnamed: 0,Age,Height,Weight
p1,18,155,50.5
p2,21,157,60.3
p3,15,150,45.0
p4,20,160,58.8
p5,24,162,64.1


In [8]:
df3.dtypes

Age         int64
Height      int64
Weight    float64
dtype: object

Some times the data file contains the comment lines. These lines can be skipped by specifying `comment` argument.

In [9]:
df4 = pd.read_csv("data4.txt", comment = "#", skipinitialspace=True)
df4

Unnamed: 0,Age,Height,Weight
p1,18,155,50.5
p2,21,157,60.3
p3,15,150,45.0
p4,20,160,58.8
p5,24,162,64.1


In some datafiles, the initial lines are part of documentation and it may not follow any well defined pattern. In such cases we can use `header` argument to specify the line number that contains the header (That is, the line that contains the column names). All lines before the header line are automatically skipped.

In [10]:
df5 = pd.read_csv("d:/data/data5.txt", header=3, skipinitialspace=True)  # header line is the line number 3
df5

Unnamed: 0,Age,Height,Weight
p1,19,155,50.5
p2,21,157,60.3
p3,15,150,45.0
p4,20,160,58.8
p5,24,162,64.1


Note that,
1. Line numbers start with 0
2. the blank lines are not considered in the line count.

**Exercise**

`read_csv` is a very powerful function supporting large variety of possibilities. Read the help on `read_csv` function and experiment with successfully reading data stored in various possible forms in the text file.

## Importing data from Excel file

### `read_excel` function

This function is designed to read data from various versions of Excel file format, as well as other spreadsheet like data file format.

In [11]:
fileName = "Yields.xlsx"
yields = pd.read_excel(fileName)
yields

Unnamed: 0,Block,Treatment,Yield
0,1,A,25.12
1,1,B,21.16
2,1,C,32.45
3,1,D,18.25
4,2,A,24.35
5,2,B,22.15
6,2,C,30.75
7,2,D,20.45
8,3,A,26.85
9,3,B,22.05


In [12]:
yields.dtypes

Block          int64
Treatment     object
Yield        float64
dtype: object

Old Microsoft Excel file format is automatically recognized from the extension `.xls`.

It is also possible to explicitely specify the file format using `engine` argument.

In [13]:
fileName = "Yields.xls"
yield2 = pd.read_excel(fileName)
yield2

Unnamed: 0,Block,Treatment,Yield
0,1,A,25.12
1,1,B,21.16
2,1,C,32.45
3,1,D,18.25
4,2,A,24.35
5,2,B,22.15
6,2,C,30.75
7,2,D,20.45
8,3,A,26.85
9,3,B,22.05


In [14]:
pd.read_excel?

It is possible to specify the name of the sheet of Excel file from which we want to import the data. For this, one has to use the `sheet_name` argument.

In [15]:
pd.read_excel("Yields2.xlsx", sheet_name = "Data")

Unnamed: 0,Block,Treatment,Yield
0,1,A,25.12
1,1,B,21.16
2,1,C,32.45
3,1,D,18.25
4,2,A,24.35
5,2,B,22.15
6,2,C,30.75
7,2,D,20.45
8,3,A,26.85
9,3,B,22.05


**Exercise**

`read_excel` is also a powerful function supporting large variety of possibilities. Read the help on `read_excel` function and experiment with successfully reading data stored in various possible forms in Excel file.

## Computing with DataFrame

## Viewing a data frame

To see the first five rows, we can use the method `head`

In [16]:
yields.head()

Unnamed: 0,Block,Treatment,Yield
0,1,A,25.12
1,1,B,21.16
2,1,C,32.45
3,1,D,18.25
4,2,A,24.35


In [17]:
yields.head(3)

Unnamed: 0,Block,Treatment,Yield
0,1,A,25.12
1,1,B,21.16
2,1,C,32.45


Similarly `tail` method can be used to see the last five rows.

## Summary Statistics

The `describe` method returns a DataFRame with summary statistics contained in rows.

In [18]:
yields.describe()

Unnamed: 0,Block,Yield
count,12.0,12.0
mean,2.0,24.248333
std,0.852803,4.795689
min,1.0,17.95
25%,1.0,20.9825
50%,2.0,23.25
75%,3.0,27.5
max,3.0,32.45


In [19]:
yields.dtypes

Block          int64
Treatment     object
Yield        float64
dtype: object

`describe` method returns descriptive statistics for all numeric columns in the input DataFrame. Here, since the dtype of Block Block was inferred as numeric (`int64`), it is included in the output.

To prevent, such incorrect prediction of dtype, we can explicitely provide dtype while importing the data as shown below.

In [20]:
yields = pd.read_excel("Yields.xlsx", dtype = {"Block":object})
yields

Unnamed: 0,Block,Treatment,Yield
0,1,A,25.12
1,1,B,21.16
2,1,C,32.45
3,1,D,18.25
4,2,A,24.35
5,2,B,22.15
6,2,C,30.75
7,2,D,20.45
8,3,A,26.85
9,3,B,22.05


In [21]:
yields.dtypes

Block         object
Treatment     object
Yield        float64
dtype: object

In [22]:
yields.describe()

Unnamed: 0,Yield
count,12.0
mean,24.248333
std,4.795689
min,17.95
25%,20.9825
50%,23.25
75%,27.5
max,32.45


To obtain descriptive statistics for non-numeric dtypes, we can use `include` argument as shown below.

In [23]:
yields.describe(include=object)

Unnamed: 0,Block,Treatment
count,12,12
unique,3,4
top,1,A
freq,4,3


## Descriptive Statistics 

The `describe` method returns a DataFrame containing a predecided set of descriptive statistics, with some flexibility of outputing specified percentiles.

Series/ DataFrame objects, however, has several methods for computing descriptive statistics. For example, mean and variance can be computes as

In [24]:
df.mean()

Age        19.6
Height    156.8
Weight     55.4
dtype: float64

In [25]:
df.var()

Age       11.3
Height    21.7
Weight    59.8
dtype: float64

Note that these methods return a Series object. 

**Exercise**: 
1. Explore the arguments that can be passed to these methods.
2. There are several other methods available for computing different descriptive statistics. Explore [these methods](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats).

A DataFrame containing descriptive statistics can be easily constructed as shown below. 

In [26]:
Means = df.mean(); Means.name = 'Mean'
Stds = df.std(); Stds.name = 'Std Dev'
Skewness = df.skew(); Skewness.name = 'Coef Skewness'
pd.DataFrame([Means, Stds, Skewness])

Unnamed: 0,Age,Height,Weight
Mean,19.6,156.8,55.4
Std Dev,3.361547,4.658326,7.733046
Coef Skewness,-0.147425,-0.605427,-0.478769


Note that the name attributes have been used as index values.

## Data Aggregration 

By aggregation operation, we mean an operation that transforms an array into a scalar. 
The methods for computing descriptive statistics such as mean, median, count, sum, min, max, etc. are examples of aggregation methods.

### Using `agg` method

The DataFrame containing descriptive statistics that we computed earlier can be more conveniently computed as

In [27]:
df.agg(['mean', 'std', 'skew'])

Unnamed: 0,Age,Height,Weight
mean,19.6,156.8,55.4
std,3.361547,4.658326,7.733046
skew,-0.147425,-0.605427,-0.478769


Desired index values can be assigned to produce custom output.

In [28]:
descriptives = df.agg(['mean', 'std', 'skew'])
descriptives.index = ['Mean', 'Std Dev', 'Skewness']
descriptives

Unnamed: 0,Age,Height,Weight
Mean,19.6,156.8,55.4
Std Dev,3.361547,4.658326,7.733046
Skewness,-0.147425,-0.605427,-0.478769


### User defined aggregate function

We can also use user defined aggregation function with the `agg` method as shown below.

In [29]:
def range(x):
    return x.max() - x.min()
df.agg(['median', range])

Unnamed: 0,Age,Height,Weight
median,20.0,157.0,58.0
range,9.0,12.0,19.0


Note that `agg` is an alias of `aggregate`. Use of the alias is more common.

### Centering data

DataFrame can be used as numpy ndarray. So all operations that can be performed on ndarray can also be performed on DataFrame. For example, if df is considered as representing data matrix, the data can be centred as 

In [30]:
df

Unnamed: 0,Age,Height,Weight
0,18,155,50
1,21,157,60
2,15,150,45
3,20,160,58
4,24,162,64


In [31]:
df.mean()

Age        19.6
Height    156.8
Weight     55.4
dtype: float64

In [32]:
df - df.mean()

Unnamed: 0,Age,Height,Weight
0,-1.6,-1.8,-5.4
1,1.4,0.2,4.6
2,-4.6,-6.8,-10.4
3,0.4,3.2,2.6
4,4.4,5.2,8.6


### Variance-covariance matrix

The variance covariance matrix can be computed as 

In [33]:
dfCentered = df-df.mean()
n = df.index.size
dfCentered.T.dot(dfCentered)/(n-1)

Unnamed: 0,Age,Height,Weight
Age,11.3,14.65,25.45
Height,14.65,21.7,33.6
Weight,25.45,33.6,59.8


Note the index values for rows and columns in the result.

This matrix can, however, be readily computed using the `cov` method.

In [34]:
df.cov()

Unnamed: 0,Age,Height,Weight
Age,11.3,14.65,25.45
Height,14.65,21.7,33.6
Weight,25.45,33.6,59.8


## Sample Datasets

For learning/ practicing data analysis using python, it is necessary to have sample datasets. 

### The `pydataset` package

The `pydataset` package provides many datasets in pandas `DataFrame` format. These datasets are mostly taken from [RDatasets](https://github.com/vincentarelbundock/Rdatasets)

To install `pydatasets` package, issue the command

    pip install pydatasets
    
A dataset from this package can be loaded as shown below.

In [35]:
from pydataset import data
iris = data('iris')
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


List of all available datasets can be seen as

In [36]:
data()

Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
1,BJsales,Sales Data with Leading Indicator
2,BOD,Biochemical Oxygen Demand
3,Formaldehyde,Determination of Formaldehyde
4,HairEyeColor,Hair and Eye Color of Statistics Students
...,...,...
752,VerbAgg,Verbal Aggression item responses
753,cake,Breakage Angle of Chocolate Cakes
754,cbpp,Contagious bovine pleuropneumonia
755,grouseticks,Data on red grouse ticks from Elston et al. 2001


### Computng Frequency table

One of the basic task in summarizing data is to present frequency table(s). Frequency table can be computed using `crosstab` function.

#### `crosstab` funcion

In [37]:
titanic = data('titanic')
titanic.head()

Unnamed: 0,class,age,sex,survived
1,1st class,adults,man,yes
2,1st class,adults,man,yes
3,1st class,adults,man,yes
4,1st class,adults,man,yes
5,1st class,adults,man,yes


In [38]:
pd.crosstab(index=titanic['class'], columns = 'count', colnames = [''])

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
1st class,325
2nd class,285
3rd class,706


In [39]:
pd.crosstab(index=titanic['class'], columns = 'count')

col_0,count
class,Unnamed: 1_level_1
1st class,325
2nd class,285
3rd class,706


Note the use of `columns` argument. Without this argument, the output looks little awkward. This is due to the fact that the function `crosstab`, as the name suggests, is designed for cross tabulation involving two or more variables.

The next example of two-way frequency table males this clear.

In [40]:
pd.crosstab(index=titanic['class'], columns = titanic['sex'])

sex,man,women
class,Unnamed: 1_level_1,Unnamed: 2_level_1
1st class,180,145
2nd class,179,106
3rd class,510,196


In [41]:
pd.crosstab(index=titanic['class'], columns = titanic['sex'], colnames = ['Gender'])

Gender,man,women
class,Unnamed: 1_level_1,Unnamed: 2_level_1
1st class,180,145
2nd class,179,106
3rd class,510,196


### $\chi^2$ test of Independence

Contingency table can be prepared by including margins in the output.

In [42]:
ContTable = pd.crosstab(index=titanic['survived'], columns = titanic['sex'], margins = True)
ContTable

sex,man,women,All
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,694,123,817
yes,175,324,499
All,869,447,1316


In [43]:
ExpContTable = ContTable.loc[:, ['All']].values * ContTable.loc[['All']].values/ ContTable.loc['All', 'All']
ExpContTable = pd.DataFrame(ExpContTable, index = ContTable.index, columns = ContTable.columns)
ExpContTable

sex,man,women,All
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,539.493161,277.506839,817.0
yes,329.506839,169.493161,499.0
All,869.0,447.0,1316.0


In [44]:
Chi2Value = (ContTable.iloc[0:2, 0:2]**2/ExpContTable.iloc[0:2, 0:2]).sum().sum() - ContTable.loc['All', 'All']
Chi2Value

343.5683748952206

#### Computing P-Value

In [45]:
from scipy import stats
PValue = stats.chi2.sf(Chi2Value, df=1)
print("P Value = {:.4f}".format(PValue))

P Value = 0.0000


#### Conclusion

?????

### Frequency table for Numeric data

Computing frequency table of numeric data is one of the most common task often performed in data analysis.

This can be accomplished using pandas `cut` function and `value_counts` method of DataFrame as shown below.

In [46]:
bins = [4.0, 4.4, 4.8, 5.2, 5.6, 6.0, 6.4, 6.8, 7.2, 7.6, 8.0]
freq = pd.cut(iris['Sepal.Length'], bins, right = False).value_counts().sort_index()
freq

[4.0, 4.4)     1
[4.4, 4.8)    10
[4.8, 5.2)    30
[5.2, 5.6)    18
[5.6, 6.0)    24
[6.0, 6.4)    25
[6.4, 6.8)    22
[6.8, 7.2)     9
[7.2, 7.6)     5
[7.6, 8.0)     6
Name: Sepal.Length, dtype: int64

In above calculation, bins are specified explicitely. However, bins can be also be computed using `np.histogram_bin_edges` function.

In [47]:
import numpy as np
bins = np.histogram_bin_edges(iris['Sepal.Length'], bins = 'sturges' )
bins

array([4.3, 4.7, 5.1, 5.5, 5.9, 6.3, 6.7, 7.1, 7.5, 7.9])

In [50]:
pd.cut(iris['Sepal.Length'], bins, right = False).value_counts()

[5.5, 5.9)    28
[5.9, 6.3)    28
[4.7, 5.1)    23
[5.1, 5.5)    20
[6.7, 7.1)    17
[6.3, 6.7)    14
[4.3, 4.7)     9
[7.1, 7.5)     5
[7.5, 7.9)     5
Name: Sepal.Length, dtype: int64

In [49]:
iris['Sepal.Length'].head()

1    5.1
2    4.9
3    4.7
4    4.6
5    5.0
Name: Sepal.Length, dtype: float64