Blog from Kdnuggets on to select subsets of data in Pandas using [], .loc, .iloc, .at, and .iat using the wine quality dataset hosted on the UCI website. This data record 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. We will only look at the data for red wine.

In [1]:
import pandas as pd

In [2]:
wine_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

In [3]:
wine_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
wine_df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

We see that many columns names have empty space. Lets rename all of those.

In [8]:
wine_df.columns = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide','density','pH','sulphates', 'alcohol', 'quality' ]

In [9]:
wine_df['fixed_acidity'].head() # recommended way

0     7.4
1     7.8
2     7.8
3    11.2
4     7.4
Name: fixed_acidity, dtype: float64

In [10]:
wine_df.fixed_acidity.head() # dont do that

0     7.4
1     7.8
2     7.8
3    11.2
4     7.4
Name: fixed_acidity, dtype: float64

# Selecting multiple columns

In [15]:
wine_four = wine_df[['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar']]

In [18]:
cols = ['fixed_acidity', 'volatile_acidity','citric_acid', 'residual_sugar']
wine_list_four = wine_four[cols]

In [17]:
wine_list_four.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar
0,7.4,0.7,0.0,1.9
1,7.8,0.88,0.0,2.6
2,7.8,0.76,0.04,2.3
3,11.2,0.28,0.56,1.9
4,7.4,0.7,0.0,1.9


# Selecting columns using "select_dtypes" and "filter" methods

In [25]:
wine_df.dtypes.value_counts() 

float64    11
int64       1
dtype: int64

In [27]:
wine_df.select_dtypes(include=['float']).head() #To select only the float columns

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


You can also use the **filter** method to select columns based on the column names or index labels.

In [28]:
wine_df.filter(like='acid').head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid
0,7.4,0.7,0.0
1,7.8,0.88,0.0
2,7.8,0.76,0.04
3,11.2,0.28,0.56
4,7.4,0.7,0.0


You can use regular expressions with the **regex** parameter in the filter method.

In [29]:
wine_df.rename(columns={'pH':'pH_5', 'quality': 'quality_6' }, inplace=True)
#rename 'ph' and 'quality' columns
wine_df.filter(regex='\d').head()
#pass the regex param to the filter method to find all the columns
#that has a number

Unnamed: 0,pH_5,quality_6
0,3.51,5
1,3.2,5
2,3.26,5
3,3.16,6
4,3.51,5
