<h2>Data Exploration with R</h2>
<p data-start="105" data-end="167">In this lesson, we will explore datasets using <strong data-start="152" data-end="164">R blocks</strong>.</p>
<p data-start="174" data-end="355">First, select a dataset and choose <strong data-start="209" data-end="223">&ldquo;penguins&rdquo;</strong> from the dropdown menu.<br data-start="247" data-end="250">Next, open the <strong data-start="267" data-end="277">Filter</strong> category, drag the <strong data-start="297" data-end="307"><code data-start="299" data-end="305">head</code></strong> block, and attach it below the dataset block.</p>
<p data-start="362" data-end="429" data-is-last-node="">Run the program and observe the printed output to preview the data.</p>

In [1]:

# VISIBLE_CODE_START
penguins |>
head(x=_, n=6)
# VISIBLE_CODE_END

  species    island bill_len bill_dep flipper_len body_mass    sex year

1  Adelie Torgersen     39.1     18.7         181      3750   male 2007

2  Adelie Torgersen     39.5     17.4         186      3800 female 2007

3  Adelie Torgersen     40.3     18.0         195      3250 female 2007

4  Adelie Torgersen       NA       NA          NA        NA   <NA> 2007

5  Adelie Torgersen     36.7     19.3         193      3450 female 2007

6  Adelie Torgersen     39.3     20.6         190      3650   male 2007

<p>The <code data-start="97" data-end="105">head()</code> function displays the first <em data-start="134" data-end="137">n</em> rows of the dataset.<br data-start="158" data-end="161">This allows us to see which columns the dataset contains, the kinds of values stored in each column, and get an initial sense of the data structure.</p>

<h2>Plotting the Dataset</h2>
<p>Now that we have a basic understanding of the dataset&rsquo;s structure, let&rsquo;s create some visualizations.<br data-start="153" data-end="156">We&rsquo;ll start by making a <strong data-start="182" data-end="198">scatter plot</strong> using the <code data-start="209" data-end="217">ggplot</code> block together with the <code data-start="242" data-end="254">geom_point</code> block.</p>

In [2]:
library(ggplot2)

# VISIBLE_CODE_START
penguins |>
ggplot(data=_, aes(x=flipper_len, y=body_mass, color=species, fill=))  + geom_point(aes(color=, shape=))
# VISIBLE_CODE_END

<p data-start="68" data-end="188">This scatter plot shows the relationship between <strong data-start="117" data-end="135">flipper length</strong> and <strong data-start="140" data-end="153">body mass</strong> for the penguins in the dataset.</p>
<p data-start="195" data-end="336">Each point represents a single penguin. The position of a point indicates its flipper length on the x-axis and its body mass on the y-axis.</p>
<p data-start="343" data-end="611">The points are colored by <strong data-start="369" data-end="380">species</strong>, allowing us to visually compare how different penguin species cluster and differ in size. We can already see that some species tend to have longer flippers and higher body mass than others, while some overlap in certain ranges.</p>
<p data-start="618" data-end="747" data-is-last-node="">This type of plot is useful for identifying patterns, groups, and possible relationships between numerical variables in the data.</p>

<h2>Encoding Sex Using Point Shape</h2>
<p>By using the <code data-start="123" data-end="130">shape</code> argument, we can add more information to the previous plot by encoding the penguins&rsquo; <strong data-start="216" data-end="223">sex</strong> as the point shape.<br data-start="243" data-end="246">This allows us to distinguish between male and female penguins while still comparing body mass and flipper length across species.</p>

In [3]:
library(ggplot2)

# VISIBLE_CODE_START
penguins |>
ggplot(data=_, aes(x=flipper_len, y=body_mass, color=, fill=))  + geom_point(aes(color=species, shape=sex)) + geom_smooth(aes(color=, fill=), method=lm)
# VISIBLE_CODE_END

<h3>Splitting Plots with facet_wrap</h3>
<p><br>In this cell, we use <strong data-start="72" data-end="90"><code data-start="74" data-end="88">facet_wrap()</code></strong> to split a single plot into multiple panels.<br data-start="135" data-end="138">Each panel shows the same relationship between flipper length and body mass, but for a different penguin species, making it easier to compare patterns across groups.</p>

In [4]:
library(ggplot2)

# VISIBLE_CODE_START
penguins |>
ggplot(data=_, aes(x=flipper_len, y=body_mass, color=, fill=))  + geom_point(aes(color=species, shape=sex)) + facet_wrap( ~ island)
# VISIBLE_CODE_END

<p data-start="237" data-end="374">Now, the&nbsp;<strong data-start="241" data-end="250">color</strong> of each point indicates the penguin&rsquo;s <strong data-start="289" data-end="300">species</strong>, while the <strong data-start="312" data-end="321">shape</strong> of the point now represents the penguin&rsquo;s <strong data-start="364" data-end="371">sex</strong>.</p>
<p data-start="381" data-end="656">By using both color and shape, we can compare multiple variables at once. This makes it easier to see whether males and females differ in body mass or flipper length within the same species, and whether these differences follow similar or different patterns across species.</p>
<p data-start="663" data-end="777" data-is-last-node="">Adding multiple aesthetic mappings helps reveal more structure in the data without changing the underlying values.</p>
<p data-start="99" data-end="186">The addition of <code data-start="115" data-end="130">geom_smooth()</code> overlays a <strong data-start="142" data-end="156">trend line</strong> on top of the scatter plot.</p>
<p data-start="193" data-end="463">This line represents a <strong data-start="216" data-end="232">linear model</strong> (<code data-start="234" data-end="247">method = lm</code>) that summarizes the overall relationship between <strong data-start="298" data-end="316">flipper length</strong> and <strong data-start="321" data-end="334">body mass</strong>. Instead of looking at individual points, the smooth line helps us see the general direction and strength of the relationship.</p>
<p data-start="470" data-end="675">When used alongside <code data-start="490" data-end="504">geom_point()</code>, the smooth line makes it easier to identify patterns, such as whether body mass tends to increase as flipper length increases, even when the data points are scattered.</p>
<p data-start="682" data-end="781" data-is-last-node="">In short, <code data-start="692" data-end="706">geom_point()</code> shows the raw data, while <code data-start="733" data-end="748">geom_smooth()</code> highlights the underlying trend.</p>

<h2>Basic Statistical Exploration</h2>
<p>In this cell, we use a&nbsp;<strong data-start="74" data-end="85">boxplot</strong> to compare the distribution of penguin <strong data-start="125" data-end="138">body mass</strong> between sexes.<br data-start="153" data-end="156">The plot summarizes differences in median, spread, and overall range for each group, with colors helping distinguish between male and female penguins.</p>

In [5]:

# VISIBLE_CODE_START
penguins |>
boxplot(body_mass ~ sex, data=_, col=heat.colors(2))
# VISIBLE_CODE_END

<h3>Summary Statistics</h3>
<p>In this cell, we compute&nbsp;<strong data-start="81" data-end="103">summary statistics</strong> for penguin body mass.<br data-start="126" data-end="129">These values provide a numerical overview of the data, including the minimum, maximum, median, mean, and quartiles, helping us better understand the distribution beyond visual plots.</p>

In [6]:

# VISIBLE_CODE_START
penguins |>
summary(object=_)
# VISIBLE_CODE_END

      species          island       bill_len        bill_dep    

 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  

 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  

 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  

                                 Mean   :43.92   Mean   :17.15  

                                 3rd Qu.:48.50   3rd Qu.:18.70  

                                 Max.   :59.60   Max.   :21.50  

                                 NA's   :2       NA's   :2      

  flipper_len      body_mass        sex           year     

 Min.   :172.0   Min.   :2700   female:165   Min.   :2007  

 1st Qu.:190.0   1st Qu.:3550   male  :168   1st Qu.:2007  

 Median :197.0   Median :4050   NA's  : 11   Median :2008  

 Mean   :200.9   Mean   :4202                Mean   :2008  

 3rd Qu.:213.0   3rd Qu.:4750                3rd Qu.:2009  

 Max.   :231.0   Max.   :6300                Max.   :2009  

 NA's   :2       NA's   :2                                 

<p>We can also compute statistics for specific subsets of the data&mdash;for example, calculating the mean <strong data-start="144" data-end="157">body mass</strong> separately for each <strong data-start="178" data-end="185">sex:</strong></p>

In [7]:

# VISIBLE_CODE_START
penguins |>
aggregate(data=_, body_mass ~ sex, 'mean')
# VISIBLE_CODE_END

     sex body_mass

1 female  3862.273

2   male  4545.685

<p data-start="78" data-end="352">The above cell calculates the <strong data-start="103" data-end="121">mean body mass</strong> for each sex in the dataset.<br data-start="150" data-end="153">The <code data-start="159" data-end="172">aggregate()</code> function groups the data by the variable on the right side of the formula (<code data-start="248" data-end="253">sex</code>) and then applies the selected function (<code data-start="295" data-end="301">mean</code>) to the variable on the left side (<code data-start="337" data-end="348">body_mass</code>).</p>
<p data-start="359" data-end="594" data-is-last-node="">Unlike <code data-start="366" data-end="377">summary()</code>, which provides overall statistics for a single variable across the entire dataset, <code data-start="462" data-end="475">aggregate()</code> computes statistics <strong data-start="496" data-end="525">separately for each group</strong>, allowing us to compare categories such as male and female penguins.</p>