This notebook contains a summary of the different plots from [Kaggle's Pandas Course](https://www.kaggle.com/learn/pandas).

First of all, we load the different necessary modules.

In [None]:
import pandas as pd
print("Setup Complete.")

# Creating, Reading and Writing
We can create a DataFrame (pandas table containing information) as follows.

In [None]:
fruit_sales = pd.DataFrame([[35, 21],[41, 34]], index=["2017 Sales", "2018 Sales"], columns=["Apples", "Bananas"])
print(fruit_sales)

Also, we can create a Series.

In [None]:
ingredients = pd.Series(["4 cups", "1 cup", "2 large", "1 can"], 
                        index=["Flour", "Milk", "Eggs", "Spam"],
                        name = "Dinner")
print(ingredients)

If we have a csv file, we can read it and convert it to a DataFrame. We have added the Wine Reviews dataset.

In [None]:
reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
print(reviews.head())

Finally, to export our DataFrames to *csv* we use the *to_csv( )* method.

# Indexing, Selecting & Assigning

We can access to a specific field of the data as follows:

In [None]:
desc = reviews["description"]
print(desc)
# or
desc = reviews.description
print(desc)

*iloc* and *loc* are used to access to a particular variable of a field. *iloc* accepts integers, whereas *loc* accepts strings.

In [None]:
first_description = reviews.description.iloc[0]
print(first_description)

Querying multiple information is also avaiable:

In [None]:
indices = [0, 1, 10, 100]
labels = ["country", "province", "region_1", "region_2"]
df = reviews.loc[indices, labels]
print(df.head())

We can also select all the rows that fulfill a certain specificaiton – a feature matches a desired value –. In this case, we can see the description of all the different wines from Italy.

In [None]:
italian_wines = reviews[reviews.country == "Italy"]
print(italian_wines)

# Summary Functions and Maps

In [None]:
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

Many functions in pandas allows us to compute statistics over the data. For instance we can compute the mean and the median of a feature over all the different observations.

In [None]:
median_points = reviews.points.median()
print(median_points)
mean_points = reviews.points.mean()
print(mean_points)

All the different values without reiteration:

In [None]:
countries = reviews.country.unique()
print(countries)

Also count occurrences over the dataset. For instance, how many wines do we have for each country?

In [None]:
reviews_per_country = reviews["country"].value_counts()
print(reviews_per_country)

We can compute more complex queries. For instance, if we want the wine with the best price/quality ratio:

In [None]:
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
print(bargain_wine)

Also, we can count certain words in the description of the wines. We use the method *map( )*:

In [None]:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
print(descriptor_counts)

Finally, we can transform the values of a column using a customized function using the *apply( )* function:

In [None]:
def toStar(row):
    if row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(toStar,axis="columns")
print(star_ratings)

In this case, we have "mapped" the ratings from 0-10 to stars ranging 1-3.

# Grouping and Sorting
We can create a customized table with certain features by using the function *groupby( )*. For example, we can search the name of the reviewers and the number of times they submited an evalution:

In [None]:
reviews_written = reviews.groupby('taster_twitter_handle').size()
print(reviews_written)

In addition, we can use one of the features of the data as index. Then, we can also sorted it according to personal specifications. For example, we can create a table which indexes correspond to the prices and the first column denotes the score received. This data can be sorted ascending:

In [None]:
best_rating_per_price = reviews.groupby('price').points.max().sort_index()
print(best_rating_per_price)

By means of the function *agg( )*, we can create new columns using built-in functions, for instance *min( )* and *max( )*:  

In [None]:
price_extremes = reviews.groupby('variety').price.agg([min,max])
print(price_extremes)

The previous DataFrame could be also sorted in descending order according to the value of one of its columns:

In [None]:
sorted_varieties = price_extremes.sort_values(by=['min', 'max'], ascending=False)
print(sorted_varieties)

We can compute statitics within a subgroup of the DataFrame. Also, we can use the function *describe( )* to show more statitstics on the whole data.

In [None]:
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
print(reviewer_mean_ratings)


In [None]:
reviewer_mean_ratings.describe()

Last but not least, if needed, we can create multiple indices:

In [None]:
country_variety_counts = reviews.groupby(["country", "variety"]).size().sort_values(ascending=False)
print(country_variety_counts)

# Data Types and Missing Values

We can check the data type of a column using the atribute *dtype*:

In [None]:
dtype = reviews.points.dtype
print(dtype)

Additionally, we can convert the type of one column to other:

In [None]:
point_strings = reviews.points.astype("str")
print(point_strings)

The method *isnull( )* serves to check whether there is missing data in the dataset.

In [None]:
n_missing_prices = len(reviews[reviews.price.isnull()])
print(n_missing_prices)

In case of missing data, we can fill the data with "Unknown":

In [None]:
reviews_per_region = reviews.price.fillna("Unknown")
print(reviews_per_region)

# Renaming and Combining
Renaming in Pandas is carried out using the *rename( )* method:

In [None]:
renamed = reviews.rename(columns={"region_1": "region", "region_2": "locale"})
print(renamed)

Moreover, we can rename the index axis itself:

In [None]:
reindexed = reviews.rename_axis("wines", axis="rows")
print(reindexed)

We can use concat to combine different DataFrames:
> combined_products = pd.concat([DataFrame1, DataFrame2])

Last but not least, we can merge two DataFrames with columns with the same name:
> left = powerlifting_meets.set_index("MeetID")  
> right = powerlifting_competitors.set_index("MeetID")  
> powerlifting_combined = left.join(right)

# References
1. [Kaggle's Pandas Course](https://www.kaggle.com/learn/pandas)
2. [Pandas: Python Data Analysis Library](https://pandas.pydata.org)