### Ranking:

Ranking is closely related to sorting, assigning ranks from one through the number of valid data points in an array.
By default rank breaks ties by assigning each group the mean rank:

In [2]:
import pandas as pd

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj.rank())

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64


In [4]:
#Ranks can also be assigned according to the order they’re observed in the data:
import pandas as pd
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj.rank(method='first'))

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64


In [5]:
#You can rank in descending order, too:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj.rank(ascending=False))

0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64


In [3]:
#You can rank in descending order, too:
obj = pd.Series([7, -5, 7, 4, 2, 7, 4])
#print(obj.rank(ascending=False))

print(obj.rank(ascending=False, method='max')) 

0    3.0
1    7.0
2    3.0
3    5.0
4    6.0
5    3.0
6    5.0
dtype: float64


In [5]:
#You can rank in descending order, too:

print(obj.rank(ascending=False, method='min')) 

0    1.0
1    7.0
2    1.0
3    4.0
4    6.0
5    1.0
6    4.0
dtype: float64


In [29]:
#DataFrame can compute ranks over the rows or the columns:

frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})

print("Original Data Frame: \n",frame)

print("Rank ROw wise: \n",frame.rank(axis=0))

print("Rank Column wise: \n",frame.rank(axis=1))

Original Data Frame: 
    a    b    c
0  0  4.3 -2.0
1  1  7.0  5.0
2  0 -3.0  8.0
3  1  2.0 -2.5
Rank ROw wise: 
      a    b    c
0  1.5  3.0  2.0
1  3.5  4.0  3.0
2  1.5  1.0  4.0
3  3.5  2.0  1.0
Rank Column wise: 
      a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0


### Axis indexes with duplicate values

Up until now all of the examples We’ve seen, we have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory.



In [30]:
frame["a"].dtype

dtype('int64')

In [17]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

print(obj)

a    0
a    1
b    2
b    3
c    4
dtype: int32


In [19]:
#The index’s is_unique property can tell you whether its values are unique or not:

print(obj.index.is_unique)

False


In [22]:
#Data selection is one of the main things that behaves differently with duplicates. 
#Indexing a value with multiple entries returns a Series while single entries return a scalar value:

print("Duplicate Index: \n",obj['a'])
print("Unique Index: \n",obj['c'])

Duplicate Index: 
 a    0
a    1
dtype: int32
Unique Index: 
 4


In [2]:
#The same logic extends to indexing rows in a DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 3), 
                  index=['a', 'a', 'b', 'b'],
                  columns=['a', 'a', 'b'])

print("Data Frame with Duplicate Index :\n",df)

print("Fetch Duplicate index: \n",df.loc['a'])

Data Frame with Duplicate Index :
           a         a         b
a -1.343517 -0.715283 -0.673420
a -0.039239  1.948642 -1.075378
b -0.473371 -0.335999  0.056634
b -0.626049 -0.915633  0.546471
Fetch Duplicate index: 
           a         a         b
a -1.343517 -0.715283 -0.673420
a -0.039239  1.948642 -1.075378


### Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data.

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])

print("Original Data Frame is: \n",df)

print("Sum is: \n",df.sum())

print("Sum over the rows : \n",df.sum(axis = 1))

Original Data Frame is: 
     one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
Sum is: 
 one    9.25
two   -5.80
dtype: float64
Sum over the rows : 
 a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64


In [23]:
#NA values are excluded unless the entire slice (row or column in this case) is NA. This
#can be disabled using the skipna option:
print("Original Data Frame is: \n",df)
print("Sum with NA: \n",df.sum(axis=1, skipna=False))
print("Mean with NA: \n",df.mean(axis=1, skipna=False))

Original Data Frame is: 
     one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
Sum with NA: 
 a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64
Mean with NA: 
 a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64


In [27]:
#producing multiple summary statistics in one shot:

print("Summary Statistics :\n",df.describe())

Summary Statistics :
             one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000


### Handling Missing Data:

Missing data is common in most data analysis applications. One of the goals in designing pandas was to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data.

pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays

In [1]:
#None, a Python singleton object that is often used for missing data in Python code. 
#Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, 
#but only in arrays with data type 'object' (i.e., arrays of Python objects):

import numpy as np
import pandas as pd

vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [5]:
#NaN: Missing numerical data

#The other missing data representation, NaN (acronym for Not a Number), 
#is different; it is a special floating-point value recognized by all systems 
#that use the standard IEEE floating-point representation:

vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. You should be aware that NaN is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN:

In [6]:
1 + np.nan

nan

In [7]:
2 * np.nan

nan

In [8]:
#Note that this means that aggregates over the values are well defined 
#(i.e., they don't result in an error) but not always useful:

vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

In [9]:
#NumPy does provide some special aggregations that will ignore these missing values:

np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

# NaN and None in Pandas:

NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [3]:
import numpy as np
import pandas as pd 
print(pd.Series([1,2]))
print(pd.Series([1,2,None]))
#pd.Series([1, np.nan, 2, None])
print(pd.Series([True, True]))
pd.Series([True, np.nan, True, None])
#pd.Series([True, True])
#pd.Series([1, 2])

0    1
1    2
dtype: int64
0    1.0
1    2.0
2    NaN
dtype: float64
0    True
1    True
dtype: bool


0    True
1     NaN
2    True
3    None
dtype: object

Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to np.nan, it will automatically be upcast to a floating-point type to accommodate the NA:

In [11]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int32

In [12]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a NaN value.

# Typeclass	    Conversion When Storing NAs	    NA Sentinel Value

#floating	          No change	                                         np.nan
#object	              No change	                                         None or np.nan
#integer	          Cast to float64	                                 np.nan
#boolean	          Cast to object	                                 None or np.nan


Keep in mind that in Pandas, string data is always stored with an object dtype.

# Operating on Null Values

As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:

#isnull(): Generate a boolean mask indicating missing values
#notnull(): Opposite of isnull()
#dropna(): Return a filtered version of the data
#fillna(): Return a copy of the data with missing values filled or imputed

In [31]:
data1 = pd.Series([1,2,3,np.nan])
data1

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [13]:
##Detecting null values

string_data = pd.Series(['ABC', 'XYZ', np.nan, '123'])

print("Original Series: \n",string_data)

print("Null Check: \n",string_data.isnull())

print("Another method - Null Check: \n",string_data.notnull())



Original Series: 
 0    ABC
1    XYZ
2    NaN
3    123
dtype: object
Null Check: 
 0    False
1    False
2     True
3    False
dtype: bool
Another method - Null Check: 
 0     True
1     True
2    False
3     True
dtype: bool


In [14]:
#Detecting null values
#The built-in Python None value is also treated as NA in object arrays:

string_data[0] = None

print("Updated Series: \n",string_data)

print("Null Check: \n",string_data.isnull())

print("Another method - Null Check: \n",string_data.notnull())


Updated Series: 
 0    None
1     XYZ
2     NaN
3     123
dtype: object
Null Check: 
 0     True
1    False
2     True
3    False
dtype: bool
Another method - Null Check: 
 0    False
1     True
2    False
3     True
dtype: bool


In [55]:
#dropna:  Filter axis labels based on whether values for each label have missing data, with varying thresholds 
#         for how much missing data to tolerate.

data = pd.Series([1, np.nan, 3.5, np.nan, 7])

print("Original Series: \n",data)

#data.dropna()

print("Drop NA : \n",data.dropna())

#Another method

print("Drop NA :\n",data[data.notnull()])

Original Series: 
 0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
Drop NA : 
 0    1.0
2    3.5
4    7.0
dtype: float64
Drop NA :
 0    1.0
2    3.5
4    7.0
dtype: float64


In [16]:
#You may want to drop rows or columns which are all NA or just those containing any NAs. dropna by default drops
#any row containing a missing value:

data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                  [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

print("Original Dataframe :\n",data)


print("Drop NA rows :\n",data.dropna())

#Passing how='all' will only drop rows that are all NA:
print("Drop only rows which has all NA values :\n",
      data.dropna(how = 'all'))



Original Dataframe :
      0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
Drop NA rows :
      0    1    2
0  1.0  6.5  3.0
Drop only rows which has all NA values :
      0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0


In [18]:
#Dropping columns in the same way is only a matter of passing axis=1:

data[4] = np.nan
print("Original Data Fram: \n",data)

print("Drop only columns which has all NA values :\n",
      data.dropna(axis = 1,#(columns)
                  how = 'all')) #default is how = 'any'

Original Data Fram: 
      0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN
Drop only columns which has all NA values :
      0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0


In [7]:
#For finer-grained control, the thresh parameter lets you specify a minimum number of 
#non-null values for the row/column to be kept:

data = pd.DataFrame([[1., 6.5, 3.,np.nan], 
                     [1., np.nan, np.nan,2],
                  [np.nan, np.nan, np.nan,np.nan], 
                     [np.nan, 6.5, 3.,4]])

print("Original Dataframe :\n",data)


print("After thresh: \n",data.dropna(axis='rows', thresh=2))

Original Dataframe :
      0    1    2    3
0  1.0  6.5  3.0  NaN
1  1.0  NaN  NaN  2.0
2  NaN  NaN  NaN  NaN
3  NaN  6.5  3.0  4.0
After thresh: 
      0    1    2    3
0  1.0  6.5  3.0  NaN
1  1.0  NaN  NaN  2.0
3  NaN  6.5  3.0  4.0


In [5]:
#2. Filling in Missing Data

#Rather than filtering out missing data (and potentially discarding other data along with it), 
#you may want to fill in the “holes” in any number of ways.
df = pd.DataFrame([[1., 6.5, 3.,np.nan], 
                   [1., np.nan, np.nan,2],
                  [np.nan, np.nan, np.nan,np.nan], 
                   [np.nan, 6.5, 3.,4]])
print("Original Data frame: \n",df)
print("Replace NA values with 0: \n",df.fillna(0))

#The same interpolation methods available for reindexing can be used with fillna:

Original Data frame: 
      0    1    2    3
0  1.0  6.5  3.0  NaN
1  1.0  NaN  NaN  2.0
2  NaN  NaN  NaN  NaN
3  NaN  6.5  3.0  4.0
Replace NA values with 0: 
      0    1    2    3
0  1.0  6.5  3.0  0.0
1  1.0  0.0  0.0  2.0
2  0.0  0.0  0.0  0.0
3  0.0  6.5  3.0  4.0


In [77]:
#With fillna you can do lots of other things with a little creativity. For example, you
#might pass the mean or median value of a Series:

data = pd.Series([1., np.nan, 3.5, np.nan, 7])

print("Original Series :\n",data)
print("Mean: ",data.mean())

print("Fill with NA :\n",data.fillna(data.mean()))

Original Series :
 0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
Mean:  3.8333333333333335
Fill with NA :
 0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64


In [21]:
# forward-fill
data.fillna(method='ffill')

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
1,1.0,6.5,3.0,2.0
2,1.0,6.5,3.0,2.0
3,1.0,6.5,3.0,4.0


In [22]:
# back-fill
data.fillna(method='bfill')

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,2.0
1,1.0,6.5,3.0,2.0
2,,6.5,3.0,4.0
3,,6.5,3.0,4.0


### Data Loading, Storage, and File Formats

NumPy, features low-level but extremely fast binary data loading and storage, including support for memory-mapped array.

Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.

pandas features a number of functions for reading tabular data as a DataFrame object.

read_csv: Load delimited data from a file, URL, or file-like object. Use comma as default delimiter
read_table: Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter
read_fwf: Read data in fixed-width column format (that is, no delimiters)

Few points to remember :

##### Indexing: 
    can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.
##### Type inference and data conversion: T
    This includes the user-defined value conversions and custom list of missing value markers.
##### Datetime parsing: 
    includes combining capability, including combining date and time information spread over multiple columns into 
    a single column in the result.
##### Iterating: 
    support for iterating over chunks of very large files.
##### Unclean data issues: 
    skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.


Type inference is one of the more important features of these functions; that means you don’t have to specify which columns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though

In [16]:
#Changing the working directory

import os
import sys

print(os.getcwd())

os.chdir('C:\\Analytics\\Personal\\Machine Learning\\Training\\R\\Dataset')

print(os.getcwd())

C:\Analytics\Personal\Machine Learning\Training\R\Dataset
C:\Analytics\Personal\Machine Learning\Training\R\Dataset


In [11]:
##Read the file from Current directory

df = pd.read_csv('mushroom1.csv')

print(df)

#could also have used read_table and specifying the delimiter

     Type of mushroom cap-shape cap-surface cap-color bruises     odor  \
0              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES  ALMOND   
1              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES  ALMOND   
2              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES  ALMOND   
3              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES  ALMOND   
4              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES  ALMOND   
5              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES  ALMOND   
6              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES   ANISE   
7              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES   ANISE   
8              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES   ANISE   
9              EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES   ANISE   
10             EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES   ANISE   
11             EDIBLE    CONVEX      SMOOTH     WHITE  BRUISES   ANISE   
12             EDIBLE    CONVEX      S

In [12]:
#A file will not always have a header row

df2 = pd.read_csv('mushroom2.csv',header = None)
print(df2)

         0        1       2       3
0   EDIBLE   CONVEX  SMOOTH     NaN
1   EDIBLE   CONVEX  SMOOTH     NaN
2   EDIBLE   CONVEX  SMOOTH   WHITE
3   EDIBLE   CONVEX  SMOOTH   WHITE
4   EDIBLE   CONVEX  SMOOTH   WHITE
5   EDIBLE      NaN  SMOOTH   WHITE
6   EDIBLE   CONVEX  SMOOTH   WHITE
7   EDIBLE   CONVEX  SMOOTH   WHITE
8   EDIBLE   CONVEX  SMOOTH   WHITE
9   EDIBLE  CONVEX2  SMOOTH   WHITE
10  EDIBLE  CONVEX1  SMOOTH  WHITE1


In [13]:
#you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:
df2 = pd.read_csv('mushroom2.csv',
                  names = ['col1','col2','col3','col4'] )
print(df2)

      col1     col2    col3    col4
0   EDIBLE   CONVEX  SMOOTH     NaN
1   EDIBLE   CONVEX  SMOOTH     NaN
2   EDIBLE   CONVEX  SMOOTH   WHITE
3   EDIBLE   CONVEX  SMOOTH   WHITE
4   EDIBLE   CONVEX  SMOOTH   WHITE
5   EDIBLE      NaN  SMOOTH   WHITE
6   EDIBLE   CONVEX  SMOOTH   WHITE
7   EDIBLE   CONVEX  SMOOTH   WHITE
8   EDIBLE   CONVEX  SMOOTH   WHITE
9   EDIBLE  CONVEX2  SMOOTH   WHITE
10  EDIBLE  CONVEX1  SMOOTH  WHITE1


In [14]:
#Suppose you want the col1 to be indexed:
df2 = pd.read_csv('mushroom2.csv',
                  names = ['col1','col2','col3','col4'],
                  index_col = 'col1')
print(df2)

           col2    col3    col4
col1                           
EDIBLE   CONVEX  SMOOTH     NaN
EDIBLE   CONVEX  SMOOTH     NaN
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE      NaN  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE  CONVEX2  SMOOTH   WHITE
EDIBLE  CONVEX1  SMOOTH  WHITE1


In [15]:
# Handling Missing Values


df3 = pd.read_csv('mushroom2.csv',
                  names = ['col1','col2','col3','col4'],
                  index_col = 'col1')
print(df3)

           col2    col3    col4
col1                           
EDIBLE   CONVEX  SMOOTH     NaN
EDIBLE   CONVEX  SMOOTH     NaN
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE      NaN  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE   CONVEX  SMOOTH   WHITE
EDIBLE  CONVEX2  SMOOTH   WHITE
EDIBLE  CONVEX1  SMOOTH  WHITE1


In [12]:
#Different NA sentinels can be specified for each column in a dict:

sentinels = {'col2': ['CONVEX1','CONVEX2'], 'col4': ['WHITE1']}
df4 = pd.read_csv('mushroom2.csv',
                  names = ['col1','col2','col3','col4'],
                  index_col = 'col1',
                 na_values=sentinels)
print(df4)


          col2    col3   col4
col1                         
EDIBLE  CONVEX  SMOOTH    NaN
EDIBLE  CONVEX  SMOOTH    NaN
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE     NaN  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE     NaN  SMOOTH  WHITE
EDIBLE     NaN  SMOOTH    NaN


In [13]:
#If you want to only read out a small number of rows (avoiding reading the entire file),
#specify that with nrows:

df5 = pd.read_csv('mushroom2.csv',
                  names = ['col1','col2','col3','col4'],
                  index_col = 'col1',
                 na_values=sentinels,
                  nrows = 5)
print(df5)

          col2    col3   col4
col1                         
EDIBLE  CONVEX  SMOOTH    NaN
EDIBLE  CONVEX  SMOOTH    NaN
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE


In [5]:
#To read out a file in pieces, specify a chunksize as a number of rows:

chunker = pd.read_csv('mushroom2.csv',
                  names = ['col1','col2','col3','col4'],
                  index_col = 'col1',
                 #na_values=sentinels,
                      chunksize=2)
print(chunker)
#print(type(chunker))

C:\Analytics\Personal\Machine Learning\Training\R\Dataset
<pandas.io.parsers.TextFileReader object at 0x0000027AF6BF4940>
<class 'pandas.io.parsers.TextFileReader'>


In [22]:
#The TextParser object returned by read_csv allows you to iterate over the parts of the
#file according to the chunksize


for piece in chunker:
    print(piece)

          col2    col3  col4
col1                        
EDIBLE  CONVEX  SMOOTH   NaN
EDIBLE  CONVEX  SMOOTH   NaN
          col2    col3   col4
col1                         
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
          col2    col3   col4
col1                         
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE     NaN  SMOOTH  WHITE
          col2    col3   col4
col1                         
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE  CONVEX  SMOOTH  WHITE
          col2    col3   col4
col1                         
EDIBLE  CONVEX  SMOOTH  WHITE
EDIBLE     NaN  SMOOTH  WHITE
        col2    col3  col4
col1                      
EDIBLE   NaN  SMOOTH   NaN


In [14]:
#Writing Data Out to Text Format
df5.to_csv('Pythonoutput.csv')
df5.to_csv('Pythonoutput.txt',sep='|')

In [24]:
#Other delimiters can be used (writing to sys.stdout so it just prints the text result
import sys 

df5.to_csv(sys.stdout, sep='|')

col1|col2|col3|col4
EDIBLE|CONVEX|SMOOTH|
EDIBLE|CONVEX|SMOOTH|
EDIBLE|CONVEX|SMOOTH|WHITE
EDIBLE|CONVEX|SMOOTH|WHITE
EDIBLE|CONVEX|SMOOTH|WHITE


In [17]:
#Missing values appear as empty strings in the output. You might want to denote them
#by some other sentinel value:

df5.to_csv(sys.stdout, sep='|',na_rep='NULL')

col1|col2|col3|col4
EDIBLE|CONVEX|SMOOTH|NULL
EDIBLE|CONVEX|SMOOTH|NULL
EDIBLE|CONVEX|SMOOTH|WHITE
EDIBLE|CONVEX|SMOOTH|WHITE
EDIBLE|CONVEX|SMOOTH|WHITE


In [18]:
#column labels can be disabled

df5.to_csv(sys.stdout, sep='|',na_rep='NULL', header = False)

EDIBLE|CONVEX|SMOOTH|NULL
EDIBLE|CONVEX|SMOOTH|NULL
EDIBLE|CONVEX|SMOOTH|WHITE
EDIBLE|CONVEX|SMOOTH|WHITE
EDIBLE|CONVEX|SMOOTH|WHITE


In [20]:
#You can also write only a subset of the columns, and in an order of your choosing:
df5.to_csv(sys.stdout, sep='|',na_rep='NULL',columns=['col1','col2','col3'])

col1|col1|col2|col3
EDIBLE|NULL|CONVEX|SMOOTH
EDIBLE|NULL|CONVEX|SMOOTH
EDIBLE|NULL|CONVEX|SMOOTH
EDIBLE|NULL|CONVEX|SMOOTH
EDIBLE|NULL|CONVEX|SMOOTH
