## 1. How do I read a tabular data file into pandas?

The first thing you are going to do when working with pandas is loading your file into a data-frame. There are a number of top-level functions for reading your data into pandas, we will discuss a few of them, namely, pandas.read_csv( ), pandas.read_table( ) and pandas.read_clipboard( ). 

### I. pandas.read_csv()

It is used to read files with the ".csv" extension into a data-frame. Data-frame name along with the ".head( )" method is used to show the top five rows of the data frame. We can use “.tail( )” to show bottom file rows. Here is an exert of code using read_csv( ) using URL as a file path

In [1]:
import pandas as pd

In [2]:
#comma seperated values
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head() 

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [3]:
ufo.tail() 

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,,OVAL,FL,12/31/2000 23:59


Pandas assumed the first row of data is the name of columns and added a range index (starting at 0) as row labels on the left side. But that may not always be the case. Suppose you don't have the name of the rows in the first row of your “.csv” file or want your row labels to be identified by values present in your data instead of range index, or want to read only a few of the columns from the data, or want to read just a few rows for verification (it can take quite some time to read large datasets), or want to specify the type of data in some of the columns of the dataset, etc. How do we do that?

Pandas default values for parameters are very sensible and useful. It helps us to get started with pandas quickly. But, depending upon the file we are loading we can accomplish a lot of work while reading the file if we use the parameters available in addition to just the file path used above. We will discuss some really useful parameters with the help of read_table( ) below. 

### II. pandas.read_table()

pandas.read_csv( ) and pandas.read_table( ) are almost same with one difference: CSV stands for comma-separated values, and "," is the default value of parameter sep standing for separator in read_csv( ) but for read_table( ) the default value is “\t” standing for tab. Now, we will read another file using read_table( ), check the code in In [4] standing for Input [4], and its output in Out [4] standing for Output [4]. In In-[5], we will modify some of the default values i.e. "usecols” and “nrows”. “usecols” is used to load only the needed columns instead of all the columns present in the data set while “nrows” is used to restrict the number of rows that is read from the dataset. With “usecols" you can either provide a list of column names or a list containing the position of columns you want (pay attention when I mention the type of object that the parameter equals) and with "nrows" you pass an integer value for the number of rows. To read more about what objects are accepted by different parameters and more abo the function or method you are currently using while pressing “Shift”, press “Tab” four times or refer to pandas documentation.

In [4]:
#tab separeted vales
orders = pd.read_table("http://bit.ly/chiporders")
orders.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [5]:
#tab separeted vales
orders = pd.read_table("http://bit.ly/chiporders", usecols=["order_id", "quantity", "item_price"], nrows=3)
#usecols=[0, 1, 4] can also be used, you basically mention the position of the column name instead of name
orders.head()

Unnamed: 0,order_id,quantity,item_price
0,1,1,$2.39
1,1,1,$3.39
2,1,1,$3.39


Now, let’s try to read another data set into our data-frame. Out[6] does not look right, the item in place of column name looks the same as the data below it, and it seems the data were not read properly. There is something called "Tidy data", but we will talk about it in some other blog. For now, as you can see in Out[6], the values in a row seems to be separated by pipes “|” rather than tab “\t”, so read_table( ) understood it as a single column. We need to change “\t” to “|” with the help of sep = “|”. Also, header=None signifies that the first row of the data set does not contain column labels but is a part of the data itself. We can add the name of the column with help of names = user_cols, where user_cols is a list containing column names.



In [6]:
users = pd.read_table("http://bit.ly/movieusers")
users.head()

Unnamed: 0,1|24|M|technician|85711
0,2|53|F|other|94043
1,3|23|M|writer|32067
2,4|24|M|technician|43537
3,5|33|F|other|15213
4,6|42|M|executive|98101


In [7]:
user_cols = ["user_id", "age", "gender", "occupation", "zip_code"]
users = pd.read_table("http://bit.ly/movieusers", sep="|", header=None, names=user_cols, index_col="user_id")
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [7] shows the modified code and our data-frame looks fine. Also, notice that the range index (that started with number 0) has been switched to “user_id”. You can use “index_col” with a string (“user_id") to use it as a row label or even a list of strings to use multi-index (let's talk about it in another blog). Technically, you can do it even with read_csv( ), but it just doesn’t feel right to read a non-CSV file with read_csv( ). You should also know that “usecols” will not work here as for “usecols” to work with column names, the dataset needs to have column names present. We can, however, use integers starting from '0', indicating leftmost column, and increasing by '1' in right direction to represent the columns with 'usecols'. We will simple have to pass 'usecols=[list of integers] or an integer' if we want to slect particular columns. We will go back to our “ufo” data frame, one again.

In [9]:
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


Now, we will talk a little about “.shape” and “.dtypes”. “.shape” return the number of rows and columns in our data-frame in a form of tuple while “.dtypes” represents the type of data present in individual column (e.g. float, object, int, etc.). We can change the data type with the help of “dtype" parameter. We know that the number of states in the US is much less than 18241, so we will change the data type of "State" to "category" by passing a dictionary to dtype. Why should we do so? Well, it will save space and help with performance too (in another blog). Sometimes, you will find numbers stored as object and you might want to perform calculations with those numbers. You will have to change the data type to number formats i.e. float, integer to perform those calculations. I have also used a parameter called "parse_dates". It is extremely helpful to change column containing date-time data into the date-time format (another blog ). “parse_dates" takes input depending upon the case, and I would highly recommend going through its documentation, as I cannot include everything here.



In [10]:
ufo.shape

(18241, 5)

In [11]:
ufo.dtypes

City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object

In [12]:
ufo = pd.read_csv("http://bit.ly/uforeports", dtype={"State":"category"}, parse_dates=["Time"])
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-06-01 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-06-01 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


If we check data types again, here’s what we get.

In [13]:
ufo.dtypes

City                       object
Colors Reported            object
Shape Reported             object
State                    category
Time               datetime64[ns]
dtype: object

### III. pandas.read_clipboard()

I find read_clipboard( ) interesting. You open the file that contains your tabular data, select the portion, copy it to the clipboard (Ctrl + c), and then run a few lines of code to read the selected data into a data-frame. We have two tabular data that we want to read into the DataFrame.

<img src="https://raw.githubusercontent.com/ujwal-sah/my_tutorials/master/Pandas/rc1.png" width=1000 align="center">

First, I selected the tabular data on the left, copied it, and then executed the codes below. Pandas even got the data type correct.

In [14]:
df = pd.read_clipboard()
df.head()

Unnamed: 0,Column A,Column B,Column C,Column D,Column E
0,1,one,100,0.1,a
1,2,two,200,0.2,b
2,3,three,300,0.3,a
3,4,four,400,0.4,b
4,5,five,500,0.5,a


In [15]:
df.dtypes

Column A      int64
Column B     object
Column C      int64
Column D    float64
Column E     object
dtype: object

Then, I selected the data on right and repeated everything. Notice, it even identified an index for me.

In [16]:
df = pd.read_clipboard()
df.head()

Unnamed: 0,Left,Right,Center
AAA,12,0.6,a
BBB,32,1.6,b
CCC,42,2.1,c
DDD,13,0.65,d
ABA,54,2.7,a


In [17]:
df.dtypes

Left        int64
Right     float64
Center     object
dtype: object

As a parting note, I would say pandas is a very versatile tool with several more functions to read data into a data-frame. I cannot write about them all but tried to provide you with something that will help you to understand better when you read the documentation on the pandas site. I would like to mention some read functions that you should take a look at, namely, “.read_excel( )”, “.read_html( )”, “.read_pickel( )”. With “.read_excel()”, parameters ”skiprows” and ”skipfooter" are often a lot helpful, as excel files may contain information about data along with the person who collected that data in the file along with data set, which is often not relevant to the data frame and we need to skip those lines while reading the data set into a data frame.

