# Next Gen Gaming Laptops
Analyzing the dataset to get insights and answer questions like why certain laptops are so popular:

In [None]:
# Loading all the required libraries:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Reading the dataset and viewing its top ten rows:

In [None]:
df = pd.read_csv('../input/new-egg-gaming-laptops/new_egg_gaming_laptops.csv', encoding='ISO-8859-2')
df.head(10)

Here we have data for the new generation laptops in the market. The dataset contains 13 columns. Description of all the columns is as follows: (from the left-most column):
1. Full name of the laptop
2. Price of laptop in dollars
3. Name of the brand that the laptop belongs to
4. Rating of the item out of 5 stars
5. Number of customers rate this item
6. Shipping description
7. Percentage of saving from original price
8. If the word touch in the laptop name (1) or not (0)
9. Display in inches
10. The name of microprocessor
11. Graphics Processing Unit (graphics card) present in the laptop
12. Hard Disk Drive (mechanical, spinning disk storage) 
13. Solid State Drive (electronic, integrated circuits storage)

Now, I'm going to perform a quick exploratory data analysis using profiling. This will give the overview of the dataset and the distribution of each variable. Firstly, I'll import pandas_profiling library:

In [None]:
import pandas_profiling

Now I'm initiating profiling on our dataset df:

In [None]:
pandas_profiling.ProfileReport(df)

From profiling, I've got the following information about this dataset:
1. Number of columns(variables) = 13 (as discussed also above)
2. Number of rows = 254
3. Total missing cells(values) in the dataset = 408 (ie; 12.4% of all the cells is missing)
4. Number of categorical variables in the dataset = 10
5. Number of numerical variables = 2
6. Number of boolean variables = 1 (the 'Touch' column has boolean values)
7. First column has no missing values (NaNs)
8. 2nd column has 1 missing value. Also, the average price of laptops is $1550.58 US
9. 3rd column has no NaNs. Total 11 companies are mentioned in the dataset
10. Out of total laptops:
* 86 are rated 5 stars
* 91 are rated 4 stars
* 48 are rated 3 stars
* 11 are rated 2 stars
* 18 are rated 1 stars
11. 162 laptops shipped(sold) to customers as free shipping
* There's a 'Special Shipping' category in which 56 laptops were shipped
* All other laptops have various shipping charges
12. There are total 35 kinds of discounts
* Maximum of 24 laptops sold for 23% discount
* 8 laptops were sold for 25% discount
* There are 137 NaNs in save_prec column
13. Laptops are distributed in total 8 kinds of display models
* Though there are 18 NaNs in display column
* 145 models are having display screen of 15.6 inches
* 66 models are having display screen of 17.3 inches
14. There are total 9 types of microprocessors
* 5 missing values in processor column
* 12 models are having intel i9 processor
* 175 models are having intel i7 processor
* 38 models are having intel i5 processor
* 11 models are having AMD R5 processor
* 4 models are having AMD R0 processor
15. There are 17 types of graphics cards
* 62 models are having GTX 1050 gpu
* 45 models are having GTX 1060 gpu
* 24 models are having GTX 1070 gpu
* 22 models are having RTX 2070
* 20 models are having RTX 2060
* 21 NaNs in gpu column
* 21 NaNs in gpu can also mean some models don't have gpu.
16. There are 7 types of hd drives
* But what is visible from the distribution is that, 1TB HDD is mentioned in various forms
* Total models with 1TB HDD = 77 + 20 + 1
* Other models are of 2TB HDD & 500GB HDD
* 150 NaNs in hdd column
17. There are 25 types of ss drives
* These are PCIe, NVMe, standard types
* Maximum ssd memory available is of 1TB
* 76 NaNs in ssd column

# Data Cleaning:
Here I'll first check which variables are having missing values:

In [None]:
categorical_nan = [feature for feature in df.columns if df[feature].isna().sum()>0 and df[feature].dtypes=='O']
print(categorical_nan)

Alright so 'save_prec', 'display', 'processor', 'gpu', 'hdd' & 'ssd' columns are having missing values. Now I'm filling these missing values with "Empty":

In [None]:
for feature in categorical_nan:
    df[feature] = df[feature].fillna('Empty')

But before moving forward, I need to verify if the missing values got filled or not:

In [None]:
df[categorical_nan].isna().sum()

Alright, now I'm good to explore the dataset. First I'll do it using WordCloud:

# Data Visualization:
This dataset looks very fascinating. Let me explore it thoroughly:

So firstly I'll see which are the **most common laptop_name**:

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.laptop_name)
# Creating and generating a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='GnBu', background_color="black").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Displaying the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

Haha, okay this looks messed up. I better look at **most common** **brand_name**:

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.brand_name)
# Creating and generating a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='GnBu', background_color="black").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Displaying the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

Yep now this looks better. So MSI, ASUS & Acer America are looking the most common brands. Then Lenovo, Dell & Gigabyte. Then others.

How about most common microprocessor? Let me see that:

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.processor)
# Creating and generating a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='GnBu', background_color="black").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Displaying the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

So most models have **intel i7**, then the second most common processor is **intel i5**. Followed by R5 & i9.

Time to see the most common graphics card (my favorite category):

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.gpu)
# Creating and generating a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='GnBu', background_color="black").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Displaying the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

So it's Nvidia all over the dataset! **Nvidia GTX** series is the most common in dataset, followed by **Nvidia RTX** series. Nvidia GTX is GeForce Giga Texel Shader eXtreme series. RTX is comparatively a new technology which operates on **Real Time Ray Tracing**. More on that later :P

Let's also see the most common hard disk drive:

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.hdd)
# Creating and generating a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='GnBu', background_color="black").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Displaying the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

Seems like 1TB HDD is the most common one. Followed by 500GB HDD and the least common is 2TB HDD. Well, seems like very few 2TB models are getting sold. But why is it like that? 2TB HDD should be sold more as it's not double the price of 1TB HDD, as some people may think. :P

Coming to solid state drive:

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.ssd)
# Creating and generating a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='GnBu', background_color="black").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Displaying the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

So standard ssd models are getting sold most. NVMe & PCIe models purchases are balanced. And looks like 1TB & 2TB models are barely sold. Yes may be it's because solid state drives are expensive. One surprising thing, though, is that a customer has purchased 5TB SSD! That's something! (You can see that tiny mention of 5TB at the middle). Meanwhile, one funny thing is that somebody purchased 8GB SSD. Lol :D

## Pie Charts:
Now I'll visualize the data using pie-charts:

Plotting a pie-chart for the "Highest Selling Brands":

In [None]:
color = plt.cm.YlGnBu(np.linspace(0,1,20))
df["brand_name"].value_counts().sort_values(ascending=True).head(20).plot.pie(y="laptop_name",colors=color,autopct="%0.1f%%")
plt.title("Highest Selling Brands")
plt.axis("off")
plt.show()

Plotting a pie-chart for the "Most Selling GPUs":

In [None]:
color = plt.cm.RdYlBu(np.linspace(0,1,20))
df["gpu"].value_counts().sort_values(ascending=True).head(20).plot.pie(y="laptop_name",colors=color,autopct="%0.1f%%")
plt.title("Most Selling GPUs")
plt.axis("on")
plt.show()

So Nvidia GTX series is sold the most. Moreover, GTX 1050 is highest selling GPU in the market. Followed by GTX 1060.

Let me now see which laptops in the dataset have GTX 1050 GPU:

In [None]:
gtx_105 = df[(df['gpu']=='GTX 1050')].reset_index(drop=True)
gtx_105.head(30)

Now I'll see which ones have GTX 1060 GPU:

In [None]:
gtx_106 = df[(df['gpu']=='GTX 1060')].reset_index(drop=True)
gtx_106.head(30)

I'll now see the **best rated laptops** of them all:

In [None]:
rating = df[(df['rating']=='Rating + 5')].reset_index(drop=True)
rating.head(30)

Finally, the least rated laptops:

In [None]:
rating_low = df[(df['rating']=='Rating + 1')].reset_index(drop=True)
rating_low.head(30)

So MSI brand is the best rated next gen laptops. Dell is least rated next gen laptop. ASUS is in both categories. Cool! 

As I've got a lot of interesting information from this dataset I think I should stop here now. Thank you for your time. Regards.
* Rachit Shukla